Harris defines digital divide broader and includes both limited access to information and communication technologies (ICT) and a deficit in the ability to use information gained through access to ICTs1.
People were more likely to fall into this digital divide if they were poor, a racial minority, had limited education, had a disability, or lived in an area with low population density. The digital divide often exacerbated other problems like finding an employment either by not searching relevant offers using the internet or not getting the job because of missing ICT skills.
The question this chapter tries to answer: “Which characteristics are associated with library use?”
10.3 Resources & Chapter Outline
10.3.1 Data, codebook, and R packages
Data, codebook, and R packages for learning about descriptive statistics
Follow the instructions in Box 10.1 to import and clean pew_libraries_2016_ch10.csv from https://edge.sagepub.com/harris1e or download from the original Internet data source and clean.
I am using the first option because there is nothing new for me to import and clean data files.
In my first import trial it turned out that all the factor variables are imported as character variables. So I had to add the columns specifications col_types = "nffffffff".
Listing / Output 10.2: Skim raw data for chapter 10
I have used the {skimr} package instead of {tableone}. It wouldn’t be necessary to plot a histogram for age to decide if the mean or median has to be used. The tiny histogram at the right side of the age line already shows that age is not normally distributed. But for the sake of practice I will create the histogram in the next tab.
R Code 10.3 : Show age distribution
R Code 10.4 : Table of descriptive statistics
Code
tab_desc<-tableone::CreateTableOne( data =tbl10, strata ='uses.lib', vars =c("age", "sex", "parent", "disabled","ses", "raceth", "educ", "rurality"))print(tab_desc, nonnormal ='age', showAllLevels =TRUE)
#> Stratified by uses.lib
#> level yes
#> n 792
#> age (median [IQR]) 49.00 [31.00, 62.00]
#> sex (%) female 438 (55.3)
#> male 354 (44.7)
#> parent (%) parent 222 (28.2)
#> not parent 566 (71.8)
#> disabled (%) no 679 (86.3)
#> yes 108 (13.7)
#> ses (%) medium 585 (73.9)
#> high 91 (11.5)
#> low 116 (14.6)
#> raceth (%) Non-Hispanic White 540 (75.6)
#> Hispanic 83 (11.6)
#> Non-Hispanic Black 91 (12.7)
#> educ (%) HS to 2-year degree 341 (43.1)
#> Four-year degree or more 382 (48.2)
#> < HS 69 ( 8.7)
#> rurality (%) suburban 196 (24.9)
#> rural 401 (51.0)
#> urban 189 (24.0)
#> Stratified by uses.lib
#> no p test
#> n 809
#> age (median [IQR]) 53.00 [35.00, 65.00] 0.001 nonnorm
#> sex (%) 330 (40.8) <0.001
#> 479 (59.2)
#> parent (%) 169 (20.9) 0.001
#> 639 (79.1)
#> disabled (%) 661 (82.0) 0.024
#> 145 (18.0)
#> ses (%) 612 (75.6) 0.088
#> 67 ( 8.3)
#> 130 (16.1)
#> raceth (%) 557 (74.6) 0.110
#> 111 (14.9)
#> 79 (10.6)
#> educ (%) 431 (53.3) <0.001
#> 276 (34.1)
#> 102 (12.6)
#> rurality (%) 159 (19.9) 0.002
#> 478 (59.7)
#> 164 (20.5)
Listing / Output 10.4: Descriptive statistics with bivariate tests using {tableone}
Remark 10.1. : Printing bivariate tests for all variables — a QRP
I am not feeling comfortable to use {tableone} to print descriptive statistics with bivariate test for all variables. Besides the mentioned danger of a questionable research practice (QRP) in looking for statistically significance I would like to inspect the relationships more slowly and to see more details. I think at a minimum one should examine plots of the bivariate correlations.
I have the same skepticism about the advice to “examine the frequencies and percentages in the table to identify some possible categories that may be driving the significant results for the bivariate tests”. I think one should be guided to inspect more in detail in the first instance by theoretical assumptions, an approach that is fairly well demonstrated with Bayesian model design in “Statistical Rethinking” (McElreath 2020).
Listing / Output 10.5: Bivariate exploratory data analysis for variables of chapter 10
There are too many plots (variables) in this example. I could divide easily the amount of variables into different patches as demonstrated in Columns and Mapping and inspect these results in more details. But in order to get all variations I have to plan the approach systematically which destroys the advantage of automatic plotting.
Binary logistic regression follows a similar format and process as linear regression (Chapter 9), but the outcome or dependent variable is binary. Because the outcome is binary, or categorical consisting of two categories, the model predicts the probability that a person is in one of the two categories. In this chapter we want to predict the library use uses.lib.
Because of the binary outcome the linear regression model would not work since it requires a continuous outcome. However, the linear regression statistical model can be transformed using logit transformations in order to be useful for modeling binary outcomes.
10.5.1 Formula of the logistic model
Formula 10.1 : Formula for the statistical form of the logistic model
\(p(y)\): probability of the outcome (e.g., probability of library use)
\(b_{0}\): y-intercept
\(x_{1}\) and \(x_{2}\): predictors of the outcome (e.g., age, rurality)
\(b_{1}\) and \(b_{2}\): coefficients for \(x_{1}\) and \(x_{2}\)
10.5.2 Logistic function
The logistic function has a sigmoid shape that stretches from \(–∞\) to \(∞\) on the x-axis and from \(0\) to \(1\) on the y-axis. The function can take any value along the x-axis and give the corresponding value between \(0\) and \(1\) on the y-axis.
\(t\): value along the \(x\)-axis of the function \(\sigma\): value of \(y\) for a specific value of \(t\), or the probability of \(y\) given \(t\).
In the case of logistic regression, the value of \(t\) will be the right-hand side of the regression model, which looks something like \(β_{0} + β_{1}x\), where \(x\) is an independent variable, \(β_{1}\) is the coefficient (rather than slope) for that variable, and \(β_{0}\) is the constant (rather than \(y\)-intercept).
Substituting this regression model for \(t\) in the logistic function:
ggplot2::ggplot(data =data.frame(x =c(-5, 5)), ggplot2::aes(x))+ggplot2::stat_function(fun =function(x)base::exp(x)/(1+base::exp(x)), n =100, linewidth =1,ggplot2::aes(color ="Logistic function"))+ggplot2::scale_x_continuous( labels =seq.int(10, 35, length.out =5))+ggplot2::scale_color_manual( name ="", values ="hotpink2")+ggplot2::labs( x ="Values of input", y ="Value of outcome")
Listing / Output 10.6: Shape of the logistic function
R Code 10.7 : Example of logistic function with data points
Code
## generate fake data framex1=c(10.2, 13.0, 14.1, 14.3, 14.5, 14.7, 15.0, 15.5, 16.1, 17.3, 19.0, 19.2, 19.8, 21.0, 26.5)y1=rep(0, 15)x2=c(17.8, 18.2, 19.0, 21.4, 21.5, 22.7, 24.0, 27.2, 31.0, 32.4, 33.8)y2=rep(1, 11)tbl<-tibble::tibble(x =c(x1, x2), y =c(y1, y2))|>dplyr::arrange(x)## draw logistic function with faked data pointsggplot2::ggplot( data =tbl,ggplot2::aes(x =x, y =y, color ="Logistic function"))+ggplot2::stat_smooth( data =tbl, formula =y~x, method ="glm", se =FALSE, method.args =list(family =binomial))+ggplot2::geom_point(ggplot2::aes(alpha ="Observation"), color ="grey41")+ggplot2::labs( x ="Values of input", y ="Value of outcome")+ggplot2::scale_color_manual( name ="", values ="hotpink2")+ggplot2::scale_alpha_manual( name ="", values =0.5)
Listing / Output 10.7: Example of logistic function with data points
R Code 10.8 : Example of logistic function showing probability of outcome for x = 20
Code
## generate fake data framex1=c(10.2, 13.0, 14.1, 14.3, 14.5, 14.7, 15.0, 15.5, 16.1, 17.3, 19.0, 19.2, 19.8, 21.0, 26.5)y1=rep(0, 15)x2=c(17.8, 18.2, 19.0, 21.4, 21.5, 22.7, 24.0, 27.2, 31.0, 32.4, 33.8)y2=rep(1, 11)tbl<-tibble::tibble(x =c(x1, x2), y =c(y1, y2))|>dplyr::arrange(x)## draw logistic function with faked data pointsggplot2::ggplot( data =tbl,ggplot2::aes(x =x, y =y, color ="Logistic function"))+ggplot2::stat_smooth( data =tbl, formula =y~x, method ="glm", se =FALSE, method.args =list(family =binomial),ggplot2::aes(linetype ="predictor = 20 and\noutcome = .44 example"))+ggplot2::geom_point(ggplot2::aes(alpha ="Observation"), color ="grey41")+ggplot2::geom_segment( x =20, xend =20, y =0, yend =.44, color ="#1f6fca", linetype ="dashed")+ggplot2::geom_segment( x =10, xend =20, y =.44, yend =.44, color ="#1f6fca", linetype ="dashed")+ggplot2::labs( x ="Values of input", y ="Value of outcome")+ggplot2::scale_color_manual( name ="", values ="hotpink2")+ggplot2::scale_alpha_manual( name ="", values =0.5)+ggplot2::scale_linetype_manual( name ="", values =c("dashed", "dashed"))+ggplot2::annotate("text", x =10, y =.48, label ="0.44", color ="#1f6fca")+ggplot2::annotate("text", x =20.5, y =-.02, label ="20", color ="#1f6fca")
Listing / Output 10.8: Example of logistic function showing a probability of outcome for \(x = 20\)
Let’s assume that the above graphic is a model for predicting library use from age. Then we can interpret it as a 44% probability of library use for a 20-year-old. Since 44% is lower than a 50% probability of the value of \(y\), the model is predicting that the 20-year-old does not have the outcome. So, if the outcome is library use, the logistic model would predict this 20-year-old was not a library user.
Formula 10.3 : Formula of odds related to probability
To be equivalent to the interpretation of the coefficients in linear regression, however, there is one more step. That is, what is the increase or decrease in the odds of the outcome with a one-unit increase in \(x\)?
\[
\begin{align*}
OR = \frac{e^{b_0+b_{1}(x+1)}}{e^{b_0+b_{1}x}}
\end{align*}
\tag{10.6}\]
Equation 10.6 shows that for every one-unit increase in the independent variable \(x\), the odds of the outcome increase or decrease by \(e^{b_1}\). Taking \(e\) to the power of \(b_1\) is referred to as exponentiating \(b_1.\) After a model is estimated, the analyst will usually exponentiate the b value(s) in order to report odds ratios describing the relationships between each predictor and the outcome.
10.6 Achievement 3: Interpreting a simple logistic regression
10.6.1 NHST
10.6.1.1 NHST Step 1
Write the null and alternate hypotheses:
Wording for H0 and HA
H0: The model is no better than the baseline at predicting library use.
HA: The model is better than the baseline at predicting library use.
10.6.1.2 NHST Step 2
Compute the test statistic.
The generalized linear model (GLM) or stats::glm() function can be used to estimate a binary logistic regression model. The model is generalized because it starts with the basic linear model and generalizes to other situations.
The stats::glm() function will treat the first category as the reference group (the group without the outcome) and the second category as the group with the outcome. To see the order of uses.lib, use base::levels() to show the levels in order and use stats:relevel() to re-order levels.
glue::glue("Order of uses.lib levels originally")base::levels(tbl10$uses.lib)tbl10.1<-tbl10|>dplyr::mutate(uses.lib =forcats::fct_rev(uses.lib))save_data_file("chap10", tbl10.1, "tbl10.1.rds")glue::glue("")glue::glue("---------------------------------------")glue::glue("Order of uses.lib levels reversed")base::levels(tbl10.1$uses.lib)
#> Order of uses.lib levels originally
#> [1] "yes" "no"
#>
#> ---------------------------------------
#> Order of uses.lib levels reversed
#> [1] "no" "yes"
Listing / Output 10.9: Reversing order of uses.lib
In contrast to the book the order of my levels for the uses.lib data is reversed. It makes more sense to ask about if one uses the library, e.g., “yes” has to be the second category. To get the same order as in the book, I had to reverse the order. Instead of using stats:relevel() I reversed the order with forcats::fct_rev().
R Code 10.10 : Estimate library use model and print results
Code
glm10.1<-glm( formula =uses.lib~age, data =tbl10.1, family =binomial(link ="logit"))save_data_file("chap10", glm10.1, "glm10.1.rds")base::summary(glm10.1)
#>
#> Call:
#> glm(formula = uses.lib ~ age, family = binomial(link = "logit"),
#> data = tbl10.1)
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 0.403785 0.142194 2.840 0.00452 **
#> age -0.008838 0.002697 -3.277 0.00105 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 2177.5 on 1570 degrees of freedom
#> Residual deviance: 2166.7 on 1569 degrees of freedom
#> (30 observations deleted due to missingness)
#> AIC: 2170.7
#>
#> Number of Fisher Scoring iterations: 3
Listing / Output 10.10: Estimate the simple library use model and print (= summmarize) the results
Listing / Output 10.10 uses stats::family = binomial(link = "logit")to specify the outcome variable. The “family” argument provides a description of the error distribution and link function to be used in the model. This is a convenient way to specify the details of the models and to distinguish between different types of generalized linear models that are appropriate for different kinds of outcome variables. In addition to the glm() help page you get additional information if you type ?family into the console.
Warning 10.1: Problem with stats::-prefix in front of glm() in later R code
Because of an error in the coding of the DescTools::PseudoR2() function I had to delete my stats::-prefix in front of the glm() function in Listing / Output 10.10. (See Warning 10.2)
R Code 10.11 : Get model fit, model significance, and odds ratios
#> $`Logistic regression model significance`
#> Chi-squared d.f. p
#> 10.815 1.000 0.001
#>
#> $`Contingency tables (model fit): frequency predicted`
#> Number observed
#> Number predicted 1 0 Sum
#> 1 338 298 636
#> 0 435 500 935
#> Sum 773 798 1571
#>
#> $`Count R-squared (model fit): percent correctly predicted`
#> [1] 53.34182
#>
#> $`Model sensitivity`
#> [1] 0.4372574
#>
#> $`Model specificity`
#> [1] 0.6265664
#>
#> $`Predictor odds ratios and 95% CI`
#> OR 2.5 % 97.5 %
#> (Intercept) 1.497482 1.1339164 1.9804811
#> age 0.991201 0.9859589 0.9964415
Listing / Output 10.11: Model fit, model significance, and odds ratios
10.6.1.3 NHST Step 3
Review and interpret the test statistics: Calculate the probability that your test statistic is at least as big as it is if there is no relationship (i.e., the null is true).
Example 10.4 : Chi-squared distribution with df = 1
R Code 10.12 : Theoretical chi-squared distribution with df = 1
Code
ggplot2::ggplot()+ggplot2::xlim(0, 12)+ggplot2::stat_function( fun =dchisq, args =list(df =1), linewidth =1, color ="hotpink2")+ggplot2::labs( x ="Chi-squared statistic", y ="Probability density")
Listing / Output 10.12: Theoretical chi-squared distribution with df = 1
Graphing the probability of the chi-squared distribution with 1 degree of freedom to show that the probability that the chi-squared would be 10.815 or higher — if the null hypothesis were true — is the very small area under the curve from 10.815 to the right.
To get a better look at the position \(x = 10.815\) let’s zoom into the critical region. See Listing / Output 10.13.
R Code 10.13 : Theoretical chi-squared distribution with df = 1, zoomed into the critical region of p = 10.815
Code
## Define start of shadex_shade=10.815df=1y_shade=stats::dchisq(x_shade, df)## Define sequence of x-valuestib<-tibble::tibble(x =seq(9, 20, length.out =300))|># Compute density valuesdplyr::mutate( y =stats::dchisq(x, df))## Subset data for shaded areashaded_area<-tib|>dplyr::filter(x>=x_shade)|>## Necessary as starting point for y = 0!tibble::add_row(x =x_shade, y =0, .before =1)tib|>## Plot the Chi-square distribution: df = 1ggplot2::ggplot(ggplot2::aes(x =x, y =y))+ggplot2::geom_line( linewidth =1, color ="hotpink2")+## Draw segment ggplot2::geom_segment( x =x_shade, y =0, xend =x_shade, yend =y_shade)+## Shade curveggplot2::geom_polygon( data =shaded_area, fill ="lightblue",ggplot2::aes(x =x, y =y))+ggplot2::labs( x ="Chi-squared statistic", y ="Probability density")+ggplot2::annotate( geom ="text", x =11.5, y =.00058, label =base::round(stats::dchisq(x_shade, df), 6))
Listing / Output 10.13: Theoretical chi-squared distribution with df = 1, zoomed into the critical region
Here we see that at the chi-square statistic of 10.815 the p-value is with .0005 very tiny and about 100 times much smaller as the level of statistical significance of .05.
When I developed above plots I wondered why in the book the graph is a rough line and the sample size (without NA’s) of \(n = 1501\) was mentioned. I learned that my curve was the theoretical curve whereas I need the randomized values to get a more rough simulation that is nearer the factual data.
R Code 10.14 : Randomized chi-squared distribution with df = 1, n = 1571
Code
set.seed(42)n=1571df=1chisq_dist<-tibble::tibble(x =stats::rchisq(n, df))chisq_dist|>ggplot2::ggplot(ggplot2::aes(x))+ggplot2::geom_density( linewidth =1, color ="hotpink2")+ggplot2::labs( x ="Chi-squared statistic", y ="Probability density")
Listing / Output 10.14: Randomized chi-squared distribution with df = 1, n = 1571
This is a randomized chi-squared distribution. I could not shade values \(p > 10.815\) because there were only three rows with values greater than that.
TODO: I wonder if I could provide an empirical chi-squared distribution with the actual data.
TODO 10.1: How to plot an empirical chi-squared distribution with real data?
Resource 10.1 : Video about the chi-squared distribution
I found a lengthy video tutorial (19:31) using RStudio featuring chi-squared distribution. You will see, pdf, cdf, quantile, plotting, examples, independence, goodness of fit and explanations. There is also a accompanying web page with text and the code.
10.6.1.4 NHST Step 4
Conclude and write report.
Report 10.1
The chi-squared test statistic for a logistic regression model with age predicting library use had a p-value of .001. This p-value indicates there is a .1% chance of a chi-squared statistic this large or larger if the null hypothesis were true. The null hypothesis is therefore rejected in favor of the alternate hypothesis that the model is better than the baseline at predicting library use. A logistic regression model including age was statistically significantly better than a null model at predicting library use [\(χ^2(1) = 10.82; p = .001\)].
10.6.2 Interpreting odds ratio significance
Both, base::summary() and odds.n.ends::odds.n.ends() included values and significance statistics for the age predictor. The odds.n.ends() output included odds ratios, which are easier to interpret. The outcome is transformed by the logistic function, therefore the coefficients from summary() are not easy to interpret directly.
Looking at Listing / Output 10.11 we see that agehas a odds ratio of .99. This could be interpreted as, “The odds of library use decrease by 1% for every 1-year increase in age.” The 1% comes from subtracting the odds ratio of .99 from 1 and multiplying by 100 to convert it to a percent. This is a stategy to make interpreting odds below 1 easier. Otherwise we would have to say: “The odds of library use are .99 times as high with every 1-year increase in age.”
When the confidence interval includes 1, the odds of the outcome are not statistically significantly different with a change in the independent variable.
The odds ratio for the age variable is less than 1, and the confidence interval does not include 1. Thus, the interpretation of this odds ratio would be as follows:
Report 10.2: Interpretation of age odds ratio
The odds of library use are 1% lower for every 1-year increase in age (\(OR = .99; 95% CI: .986–.996\)).
10.6.2.1 NHST Step 1
Write the null and alternate hypotheses:
Wording for H0 and HA
H0: Library use is not associated with age.
HA: Library use is associated with age.
10.6.2.2 NHST Step 2
Compute the test statistic.
The base::summary() function following a stats::glm() function includes a z-statistic comparing the coefficient estimate to zero. The z-score is a measure how many standard deviations away a value is from a mean. But for logistic regression, the coefficient divided by the standard error — not by the standard deviation — follows a z-distribution. This z-statistic is the test statistic for the Wald test, which has the same purpose (but follows a different distribution) as the Wald test from Section 9.7.
In the case of the age variable, dividing the estimate of \(-0.008838\) by its standard error of \(0.002697\) gives a z-statistic of \(-3.276974\), which is well beyond the boundary of \(–1.96\) for statistical significance when \(\alpha\) is set at \(.05\).
The z-distribution is a normal distribution with a mean of 0 and a standard deviation of 1.
R Code 10.15 : z-distribution for sample size of 1571
Code
nhstplot::plotztest( z =-3.28, tails ="one", xmax =5)
Listing / Output 10.15: z-distribution for sample size of 1571
10.6.2.3 NHST Step 3
Review and interpret the test statistics: Calculate the probability that your test statistic is at least as big as it is if there is no relationship (i.e., the null is true).
The area under the curve to the left of the test statistic in Listing / Output 10.15 is very small. Given this area is very small, the p-value of .00105 in the output Listing / Output 10.10 from base::summary(glm10.1) makes sense. There is a .105% probability that this sample came from a population where there was no relationship between age and library use. Thus, there is a statistically significant relationship between age and library use (z = –3.28; p = .001).
10.6.2.4 NHST Step 4
Conclude and write report.
The odds ratio for age is .99 with a 95% CI of .986–.996. The confidence interval shows the range where the odds ratio likely is in the population. Because the confidence interval does not include 1, this indicates that the odds ratio is statistically significantly different from 1. The interpretation would be as follows:
Report 10.3: Interpreting odds ratio significance
The null hypothesis of no relationship between library use and age is rejected. The odds of library use are 1% lower for every 1-year increase in age in the sample (OR = .99; 95% CI: .986–.996). The 95% confidence interval indicates that the odds of library use are .4%–1.4% lower with each 1-year increase in age in the population that the sample came from.
10.7 Achievement 4: Interpreting two measures of model fit
For linear regression, the \(R^2\) statistic measured how well the model fit the observed data by measuring how much of the variability in the outcome was explained by the model. The concept of variance is appropriate for continuous but not categorical variables. There are several measures for categorical model fit, including the percent correctly predicted, sometimes called count \(R^2\) and the adjusted count \(R^2\), which adjusts the count \(R^2\) for the number of people in the largest of the two categories of the outcome.
10.7.1 Percent correctly predicted
The percent correctly predicted by the model is computed using the predicted probabilities, or fitted values, for each of the observations and comparing these probabilities to the true value of the outcome.
Formula 10.4 : Percent correctly predicted or count \(R^2\)
For example, if a person in the data set were predicted to have a chance of more than 50% of library use, this would be transformed into a “yes” or “1” value of the outcome and then compared to the person’s actual library use. If the predicted value and the true value matched, this would be considered a correct prediction. The same vice-versa for predicted values less than 50%. The total number of people the model gets correct out of the total number of people in the data analyzed is the percent correctly predicted or count \(R^2\).
The Listing / Output 10.11 includes a table showing how many observations were correctly predicted in each category of the outcome.
Formula 10.5 : Calculate the percent correctly predicted with the glm10.1 data
An alternative measure of percent correctly predicted is the adjusted count R^2 measure. It adjusts the count \(R^2\) for the number of people in the largest of the two categories of the outcome.
The argument behind this adjustment is that a null model, or a model with no predictors, could get a good percent correctly predicted just by predicting everyone was in the outcome category that had the bigger percentage of people in it.
For the library use data, the most common category is library nonuse (or 0), with 798 of the 1571 participants with complete data for the model. Without knowing anything about library use, you could predict everyone in the data set was a nonuser and be right \(\frac{788}{1571}\) or 50.8% of the time. Using the age predictor, the model is right \(\frac{338+500}{1571}\) or 53% of the time. While this is not a huge increase, it did classify 40 additional people correctly compared to using the percentages in the outcome categories with no other information.
The model using age to predict library use was correct 53% of the time (Count \(R^2\) = .53). The adjusted count \(R^2\) would be \(\frac{338+500-798}{1571-798}\) or .05.
Report 10.4: Interpretation of adjusted \(R^2\) in glm10.1
There were 5% more correct predictions by the age model than by the baseline (Adjusted Count \(R^2\) = .05).
10.7.3 Pseudo-\(R^2\)
Pseudo-\(R^2\) is another measure that are reported relatively frequently. This measure generally tries to quantify the reduction in lack of fit between a null and full model. You can compute Pseudo-\(R^2\)
Bullet List 10.1
: Packages with function for computing pseudo-\(R^2\)
with r2_mcfadden() from the {performance} package (see: Section A.63)
with PseudoR2() from the {DescTools} package (see: Section A.15),
But even in this resource not all different pseudo-\(R^2\) measures are explained. Anyway it is an complex issues where I have to spend more time to study the details.
Caution 10.1: Interpreting pseudo-\(R^2\)
Even in Resource 10.2 not all different pseudo-\(R^2\) measures are explained. It is an complex issues where I have to spend more time to study the details.
Example 10.5 : Computing Pseudo \(R^2\) with different packages
#> Index Estimate
#> 1 Nagelkerke 0.009
#> 2 Hosmer and Lemeshow 0.005
#> 3 Cox and Snell 0.007
Listing / Output 10.20: Computing Pseudo-\(R^2\) with {pubh}
10.7.4 Sensitivity and specificity
Sometimes it is useful to know whether the model is better at predicting people with the outcome or people without the outcome. The measures used for these two concepts are sensitivity and specificity.
Sensitivity determines the percentage of the 1s or “yes” values the model got correct, while specificity computes the percentage of 0s or “no” values the model got correct.
In Listing / Output 10.11, the sensitivity is 43.7% while the specificity is 62.7%. The model was better at predicting the “no” values than the “yes” values. These percentages could also be computed from the frequency table in the output: The model predicted 500 of the 798 people in the 0 category correctly (62.7%) and 338 of the 773 in the 1 category correctly (43.7%).
10.8 Achievement 5: Estimating a larger logistic regression model
10.8.1 Compute full logistic regression model
Example 10.6 : Compute full logistic regression model
R Code 10.21 : Compute full logistic regression model, first try (glm10.2)
Code
# estimate the full library use model and print resultsglm10.2<-glm( formula =uses.lib~age+sex+educ+parent+disabled+rurality+raceth+ses, data =tbl10.1, na.action =na.exclude, family =binomial("logit"))odds.n.ends::odds.n.ends(glm10.2)
Listing / Output 10.21: Full logistic regression model for library use data (glm2)
My values are quite different than the outcome in the book. The reason is that I have the levels ordered differently. I will order my levels so that I get the same result as in the book.
R Code 10.22 : Order level to match the result from the book
Listing / Output 10.22: Level ordered to match the result from the book
Caution 10.2: Why levels such as ses not ordered to their intrinsically ordering?
I have ordered the levels to match the book outcome. But I do not understand why some of the levels (educ, rurality or ses) are not ordered from high to low or vice versa. The highest or lowest level is in the order of the book always in the middle.
R Code 10.23 : Compute full logistic regression model, final version (glm10.3)
Code
# estimate the full library use model # matching the level order from the book # and print results with odds.n.ends()glm10.3<-glm( formula =uses.lib~age+sex+educ+parent+disabled+rurality+raceth+ses, data =tbl10.3, na.action =na.exclude, family =binomial("logit"))save_data_file("chap10", glm10.3, "glm10.3.rds")odds.n.ends::odds.n.ends(glm10.3)
Listing / Output 10.23: Compute full logistic regression model, final version (glm10.3)
Caution 10.3: Using na.action = na.omit instead of na.action = na.exlcude
To find influential values I have tried several different packages. Using {glmtoolbox} I received an error message when I used na.action = na.exlcude. In the help file to the influence() function I learned:
Cases omitted in the fit are omitted unless a na.action method was used (such as na.exclude) which restores them. … If a model has been fitted with na.action = na.exclude (see na.exclude), cases excluded in the fit are considered here.
This is a weird behavior because I thought that na.exclude will exclude and not restore the NA-values. But if I tried to change therefore to na.action = na.omit I got an error message in Listing / Output 10.27: “res_std must be size 1601 or 1, not 1427.”
Finally I decided to stick with na.action = na.exclude (or to delete this line completely with the same result) and not to use the {glmtoolbox} package.
10.8.2 NHST Step 1
Write the null and alternate hypotheses:
Wording for H0 and HA
H0: A model containing age, sex, education, parent status, disability status, rurality, SES, and race-ethnicity is no better than the baseline at explaining library use.
HA: A model containing age, sex, education, parent status, disability status, rurality, SES, and race-ethnicity is better than the baseline at explaining library use.
10.8.3 NHST Step 2
Compute the test statistic.
The chi-squared test statistic of 94.74 was computed by the odds.n.ends() function.
R Code 10.24 : Visualizing a chi-squared distribution with n = 1427, df = 12 and test statistic of 94.74
Listing / Output 10.24: Chi-squared distribution with n = 1427, df = 12 and a test statistic of 94.74
10.8.4 NHST Step 3
Review and interpret the test statistics: Calculate the probability that your test statistic is at least as big as it is if there is no relationship (i.e., the null is true).
The probability that the chi-squared would be 94.74 if the full model were no better than the null model is shown as the area under the curve to the right starting at 94.74. This corresponds with a very small p-value < .001.
10.8.5 NHST Step 4
Conclude and write report.
Report 10.5
A logistic regression model containing age, sex, education, parental status, disability status, SES, race-ethnicity, and rurality was statistically significantly better than the baseline probability at predicting library use [\(χ^2(12) = 94.74; p < .001\)].
10.9 Achievement 6: Interpreting the results of a larger logistic regression model
10.9.1 Interpreting odds ratios
Listing / Output 10.21 includes the odds ratio which are easier to interpret as the significance statistics for the predictors. When the confidence interval includes 1, the odds ratio could be 1 and this indicates it is not statistically significantly different from 1. If the odds ratio is greater than 1 and the confidence interval does not include 1, the odds ratio suggests a statistically significant increase in the odds of the outcome.
Factor variables with two categories were interpreted with respect to the reference group, or the group not shown in the output. Since the factor variables can only change by going from one group to the other, instead of the odds ratio being for a one-unit change in the predictor, it is the change in the odds of the outcome when moving from the reference group to the other group.
For the sex variable, male is the group shown in the output, so interpreting the odds ratio for sex is as follows: “Males have 51% lower odds of library use compared to females.” She showed Leslie that the 51% came from subtracting the odds ratio of .489 from 1 (1 – .49 = .51) and multiplying by 100.
For factor variables with more than two categories each odds ratio is interpreted with respect to the reference group for that variable. For educ, the group not shown is the < HS group, indicating that it is the reference group. The odds ratio for the Four-year degree or more group is 1.90, so the interpretation would be that individuals with a four-year degree or more have 1.90 times higher odds of library use compared to people with less than a high school education.
A list of the odds ratios where the confidence interval did not include 1:
age (OR = .99; 95% CI: .984–.996)
sexmale (OR = .49; 95% CI: .39–.61)
educFour year degree or more (OR = 1.90; 95% CI: 1.26–2.90)
racethNon-Hispanic Black (OR = 1.55; 95% CI: 1.002–2.417)
10.9.1.1 Interpreting significant odds ratios greater than 1
There were two significant predictors with odds ratios greater than 1.
Report 10.6: Interpreting educ and raceth as significant odds ratios greater than 1
People with a four-year degree or more education had 1.90 times higher odds of library use compared to people with less than a high school education (OR = 1.90; 95% CI: 1.26–2.90).
People who were non-Hispanic Black had 1.55 times higher odds of library use compared to people who were Hispanic (OR = 1.55; 95% CI: 1.002–2.417).
10.9.1.2 Interpreting non-significant odds ratios greater than 1
When the confidence interval includes 1, it is possible that the true value of the odds ratio is 1. Therefore the values would be reported without the interpretation of higher odds. For instance:
Report 10.7: Interpreting rurality as a non-significant odds ratio greater than 1
The odds of library use were not statistically significantly different for urban residents compared to rural residents (OR = 1.23; 95% CI: .93–1.63).
10.9.1.3 Interpreting significant odds ratios less than 1
The age variable and male sex both show significant odds ratios lower than 1.
Report 10.8: Interpreting age and sex as significant odds ratios less than 1
For age, the odds of library use are 1% lower for every 1-year increase in a person’s age (OR = .99; 95% CI: .984–.996).
For male sex, the reference group is female and the odds ratio is .49. Subtracting 1 – .49 results in .51, so the odds of library use are 51% lower for males compared to females (OR = .49; 95% CI: .39–.61).
10.9.1.4 Interpreting non-significant odds ratios less than 1
The low-SES category had a non-significant odds ratio of .93 (95% CI: .571.52). The interpretation would be as follows:
Report 10.9: Interpreting ses as a non-significant odds ratio less than 1
The odds of library use are not statistically significantly different for those with low SES compared to those in the reference group of high SES (OR = .93; 95% CI: .57–1.52).
10.9.2 Compute and interpret model fit
10.9.2.1 Count \(R^2\)
The model correctly predicted 378 of the 696 who use the library and 482 of the 731 who do not use the library. Overall, it was correct for \(\frac{378+482}{1427} = \frac{860}{1427}\) of the observations, or 60.3% of the time (count \(R^2 = .603\)). It was better at classifying those who do not use the library (specificity = 65.9%) than those who use the library (sensitivity = 54.3%).
There are three assumptions for logistic regression:
independence of observations,
linearity, and
no perfect multicollinearity.
10.10.1 Logistic regresion assumptions
10.10.1.1 Independence of observation
The Pew Research Center conducted a phone survey where they selected a single person in a randomly selected household. The assumption is met.
10.10.1.2 Linearity
For logistic regression, the outcome variable is binary, so its relationship with another variable will never be linear. Instead of plotting the relationship of the outcome with each continuous predictor, linearity is tested by plotting the log-odds of the predicted probabilities for the outcome against each of the continuous predictors in the model. For example, are the predicted values equally accurate for people of a younger age compared to people of an older age?
R Code 10.25 : Checking linearity of the age variable for the model of library use glm10.3
Code
#make a variable of the log-odds of the predicted valueslogit_use<-log(x =glm10.3$fitted.values/(1-glm10.3$fitted.values))# make a small data frame with the log-odds variable and the age predictortbl10_lin<-tibble::tibble(logit_use, age =glm10.3$model$age)# create a plot (Figure 10.9)tbl10_lin|>ggplot2::ggplot(ggplot2::aes( x =age, y =logit_use))+ggplot2::geom_point(ggplot2::aes( size ="Observation"), color ="gray60", alpha =.6)+ggplot2::geom_smooth( formula =y~s(x, bs ="cs"), method ="gam", se =FALSE, ggplot2::aes( color ="Loess curve"))+ggplot2::geom_smooth( formula =y~x, method =lm, se =FALSE, ggplot2::aes( color ="linear"))+ggplot2::labs( x ="Age in years", y ="Log-odds of library use predicted probability")+ggplot2::scale_color_manual( name ="Type of fit line", values =c("dodgerblue2","deeppink"))+ggplot2::scale_size_manual( values =1.5, name ="")
Listing / Output 10.25: Checking linearity of the age variable for the model of library use glm10.3
The graph shows the Loess curve close to the linear fit line with the exception of the youngest ages.
It is up to the analyst to determine whether the actual relationship is close enough to linear to meet this assumption. If the assumption is not met, this variable might be removed from analysis, transformed, or recoded.
spline regression is one way to deal with problems of linearity in linear and logistic regression.
It might be worth considering that the data frame includes people as young as 16 years old; it is possible that there are different predictors of library use before adulthood and restricting the age range of the data frame to adults over 18 could be another option for addressing this deviation from linearity in the youngest survey participants.
Note 10.1: Why older than 18 and not older than 25?
It seems for me that there is still a problem with people 18 years old. The loess curve in Listing / Output 10.25 only gets better with people older than 25 years.
Generally I have a bad feeling with the sentence: “It is up to the analyst to determine whether the actual relationship is close enough to linear to meet this assumption.” For me there seems to much subjectivity in this approach. As I am already remarked in TODO 9.2: I have to look for tools, like statistical tests to get a more objective decision if the assumption of linearity is met.
10.10.1.3 No perfect multicollinearity
The GVIF is similar to the VIF used for linear regression, but modified to account for the categorical outcome. It examines how well each predictor variable in the model is explained by the group of other predictor variables. If a predictor is well explained by the others, it is redundant and unnecessary. For the GVIF, often a threshold of \(GVIF^{\frac{1}{2 \times df}} < 2.5\) is used as a cutoff, with values of 2.5 or higher indicating a failed multicollinearity assumption.
Listing / Output 10.26: Compute and show GVIF values
None of the values in the right-hand column have a value of 2.5 or higher, so there is no discernable problem with multicollinearity.
10.10.1.4 Conclusion
Overall, the assumption checking revealed a possible problem with age as a predictor, mostly at the youngest ages. The other assumptions were met. It might be useful to restrict the age variable to adults or to transform the age variable.
10.10.2 Model diagnostics
10.10.2.1 Finding outliers with standardized residuals
Referring to Listing / Output 9.18 residuals are the distances between the predicted value of the outcome and the true value of the outcome for each person or observation in the data set.
These values are standardized by computing z-scores for each one so that they follow a z-distribution. Standardized residuals are z-scores for the residual values. Z-scores that are greater than 1.96 or less than –1.96 are about two standard deviations or more away from the mean of a measure. In this case, they are more than two standard deviations away from the mean residual value. Very large values of standardized residuals can indicate that the predicted value for an observation is far from the true value for that observation, indicating that an examination of that observation could be useful.
Example 10.7 : Standardize residuals and check if z-scores greater than 2
# get standardized residuals and add to data frametbl10.4<-tbl10.3|>dplyr::mutate(res_std =stats::rstandard(model =glm10.3))|>tibble::rowid_to_column("ID")# check the residuals for large values > 2tbl10.4|>tidyr::drop_na(res_std)|>dplyr::summarize(max_resid =max(abs(x =res_std)))
Listing / Output 10.27: Compute and check residuals
The maximum absolute value of any standardized residual was less than 1.96, so the standardized residuals did not reveal any outliers.
R Code 10.28 : Standardized residuals for library use model
Code
tbl10.4|># tidyr::drop_na(res_std) |>ggplot2::ggplot(ggplot2::aes( x =ID, y =res_std, group ="Library use"))+ggplot2::geom_jitter(ggplot2::aes( color =uses.lib))+ggplot2::labs( x ="Observation number", y ="Standarized reeisudals")+ggokabeito::scale_okabe_ito( name ="Library use", aesthetics ="color", order =c(1,3), alpha =.6)
#> Warning: Removed 174 rows containing missing values or values outside the scale range
#> (`geom_point()`).
Listing / Output 10.28: Standardized residuals for library use model
This is my reproduction of book’s Figure 10.10, using the colorblind-friendly Okabe-Ito scale. Most predicted probabilities were about one standard deviation above or below the mean predicted probability. So there were not outliers to examine.
10.10.3 dfbeta(s)
As I already mentioned in Warning 9.1 the book mixes dfbeta & dfbetas together. But these two are different measures:
dfbetas (plural) are the standardized values, with a cutoff of \(2 / \sqrt{n}\). Because of standardization dfbetas are the recommended measure to use. - dfbeta (singular) values depend on the scale of the data.
R Code 10.29 : Computing different influence measures
Code
## get influence statistics infl10.3<-stats::influence.measures(model =glm10.3)## print influential dataglue::glue("The next two results are from 'example(influence.measure)'")glue::glue("")glue::glue("Print list of influential data")which(apply(infl10.3$is.inf, 1, any))## which observations 'are' influentialglue::glue("")glue::glue("Which observations 'are' influential?")summary(infl10.3)## print all diagnostic dataglue::glue("")glue::glue("Print a summary of all diagnostic data")base::summary(infl10.3$infmat)
#> The next two results are from 'example(influence.measure)'
#>
#> Print list of influential data
#> named integer(0)
#>
#> Which observations 'are' influential?
#> Potentially influential observations of
#> glm(formula = uses.lib ~ age + sex + educ + parent + disabled + rurality + raceth + ses, family = binomial("logit"), data = tbl10.3, na.action = na.exclude) :
#> NONE
#> numeric(0)
#>
#> Print a summary of all diagnostic data
#> dfb.1_ dfb.age dfb.sxml
#> Min. :-0.1051922 Min. :-0.0767940 Min. :-0.039027
#> 1st Qu.:-0.0129510 1st Qu.:-0.0160893 1st Qu.:-0.023107
#> Median : 0.0000000 Median : 0.0000000 Median :-0.015882
#> Mean : 0.0001392 Mean :-0.0005355 Mean :-0.001121
#> 3rd Qu.: 0.0125725 3rd Qu.: 0.0142736 3rd Qu.: 0.024395
#> Max. : 0.1142539 Max. : 0.0825243 Max. : 0.055434
#>
#> dfb.edom dfb.et2d dfb.prnt
#> Min. :-0.1161072 Min. :-0.1179326 Min. :-0.073641
#> 1st Qu.:-0.0106307 1st Qu.:-0.0069438 1st Qu.:-0.013862
#> Median : 0.0000000 Median : 0.0000000 Median : 0.000000
#> Mean : 0.0005464 Mean : 0.0001351 Mean : 0.000333
#> 3rd Qu.: 0.0113543 3rd Qu.: 0.0068441 3rd Qu.: 0.014662
#> Max. : 0.0967673 Max. : 0.0949315 Max. : 0.069974
#>
#> dfb.dsbl dfb.rrltys dfb.rrltyr
#> Min. :-0.079962 Min. :-0.0733540 Min. :-0.0742958
#> 1st Qu.:-0.011309 1st Qu.:-0.0152557 1st Qu.:-0.0150995
#> Median : 0.000000 Median : 0.0000000 Median : 0.0000000
#> Mean :-0.000256 Mean : 0.0002054 Mean : 0.0002627
#> 3rd Qu.: 0.009065 3rd Qu.: 0.0139402 3rd Qu.: 0.0147617
#> Max. : 0.089810 Max. : 0.0741895 Max. : 0.0714801
#>
#> dfb.rN-B dfb.rN-W dfb.sslw
#> Min. :-0.0843655 Min. :-0.1059933 Min. :-9.111e-02
#> 1st Qu.:-0.0023347 1st Qu.:-0.0079223 1st Qu.:-9.360e-03
#> Median : 0.0000000 Median : 0.0000000 Median : 0.000e+00
#> Mean : 0.0003143 Mean : 0.0002498 Mean :-2.919e-05
#> 3rd Qu.: 0.0025706 3rd Qu.: 0.0083564 3rd Qu.: 8.515e-03
#> Max. : 0.0847660 Max. : 0.0939644 Max. : 9.547e-02
#>
#> dfb.ssmd dffit cov.r cook.d
#> Min. :-0.1118319 Min. :-0.17302 Min. :0.9952 Min. :0.00011
#> 1st Qu.:-0.0062815 1st Qu.:-0.09024 1st Qu.:1.0035 1st Qu.:0.00032
#> Median : 0.0000000 Median :-0.04160 Median :1.0079 Median :0.00057
#> Mean :-0.0001344 Mean :-0.00026 Mean :1.0082 Mean :0.00071
#> 3rd Qu.: 0.0073985 3rd Qu.: 0.08853 3rd Qu.:1.0127 3rd Qu.:0.00093
#> Max. : 0.1113670 Max. : 0.18890 Max. :1.0250 Max. :0.00353
#> NA's :174 NA's :174
#> hat
#> Min. :0.000000
#> 1st Qu.:0.004505
#> Median :0.007261
#> Mean :0.008120
#> 3rd Qu.:0.011625
#> Max. :0.022682
#>
Listing / Output 10.29: Computing different influence measures
Generally it turned out that there are no influential values! Otherwise I would have get an output where different variable results would have be signed with an asterisk.
I had a hard time to understand the output, because the names of the variables do not conform to the name in the tibble. But after some trial and error I understood that I have to look for the variable name without vocals and add a combination of letters for the levels:
dfb.1_: intercept
dfb.age: age
dfb.sxml: sex-male
dfb.prnt: parent-parent
dfb.dsbl: disabled-yes
dfb.rrltys: rurality-suburban
dfb.rrltyr: rurality-rural
dfb.sslw: ses-low
dfb.ssmd: ses-medium
dfb.edom: educ-Four-year degree or more
dfb.et2d: educ-HS to 2-year degree
dfb.rN-B: raceth-Non-Hispanic Black
dfb.rN-W: raceth-Non-Hispanic White
This list was confirmed after I computed dfbetas directly with Listing / Output 10.30 because I got better variable names for the results.
I added to Listing / Output 10.29 two measures provided by an additional example of the last line of the stats::lm-influence() help page (“For more”user level” examples, use example(influence.measures)“) or respectively the examples in the stats::influence.measures() help page.
#> (Intercept) age sexmale
#> Min. :-0.1051922 Min. :-0.0767940 Min. :-0.039027
#> 1st Qu.:-0.0129510 1st Qu.:-0.0160893 1st Qu.:-0.023107
#> Median : 0.0000000 Median : 0.0000000 Median :-0.015882
#> Mean : 0.0001392 Mean :-0.0005355 Mean :-0.001121
#> 3rd Qu.: 0.0125725 3rd Qu.: 0.0142736 3rd Qu.: 0.024395
#> Max. : 0.1142539 Max. : 0.0825243 Max. : 0.055434
#> educFour-year degree or more educHS to 2-year degree parentparent
#> Min. :-0.1161072 Min. :-0.1179326 Min. :-0.073641
#> 1st Qu.:-0.0106307 1st Qu.:-0.0069438 1st Qu.:-0.013862
#> Median : 0.0000000 Median : 0.0000000 Median : 0.000000
#> Mean : 0.0005464 Mean : 0.0001351 Mean : 0.000333
#> 3rd Qu.: 0.0113543 3rd Qu.: 0.0068441 3rd Qu.: 0.014662
#> Max. : 0.0967673 Max. : 0.0949315 Max. : 0.069974
#> disabledyes ruralitysuburban ruralityurban
#> Min. :-0.079962 Min. :-0.0733540 Min. :-0.0742958
#> 1st Qu.:-0.011309 1st Qu.:-0.0152557 1st Qu.:-0.0150995
#> Median : 0.000000 Median : 0.0000000 Median : 0.0000000
#> Mean :-0.000256 Mean : 0.0002054 Mean : 0.0002627
#> 3rd Qu.: 0.009065 3rd Qu.: 0.0139402 3rd Qu.: 0.0147617
#> Max. : 0.089810 Max. : 0.0741895 Max. : 0.0714801
#> racethNon-Hispanic Black racethNon-Hispanic White seslow
#> Min. :-0.0843655 Min. :-0.1059933 Min. :-9.111e-02
#> 1st Qu.:-0.0023347 1st Qu.:-0.0079223 1st Qu.:-9.360e-03
#> Median : 0.0000000 Median : 0.0000000 Median : 0.000e+00
#> Mean : 0.0003143 Mean : 0.0002498 Mean :-2.919e-05
#> 3rd Qu.: 0.0025706 3rd Qu.: 0.0083564 3rd Qu.: 8.515e-03
#> Max. : 0.0847660 Max. : 0.0939644 Max. : 9.547e-02
#> sesmedium
#> Min. :-0.1118319
#> 1st Qu.:-0.0062815
#> Median : 0.0000000
#> Mean :-0.0001344
#> 3rd Qu.: 0.0073985
#> Max. : 0.1113670
Listing / Output 10.30: Compute dfbetas
Besides the better variable names it also turned out the in Listing / Output 10.29 the results are dfbetas and not dfbeta as I have computed in Listing / Output 10.31 to see the differences.
#> (Intercept) age sexmale
#> Min. :-4.116e-02 Min. :-2.908e-04 Min. :-0.0050592
#> 1st Qu.:-5.068e-03 1st Qu.:-6.093e-05 1st Qu.:-0.0029956
#> Median : 0.000e+00 Median : 0.000e+00 Median :-0.0020589
#> Mean : 5.456e-05 Mean :-2.031e-06 Mean :-0.0001456
#> 3rd Qu.: 4.920e-03 3rd Qu.: 5.405e-05 3rd Qu.: 0.0031625
#> Max. : 4.471e-02 Max. : 3.124e-04 Max. : 0.0071843
#> educFour-year degree or more educHS to 2-year degree parentparent
#> Min. :-0.0284321 Min. :-2.685e-02 Min. :-1.148e-02
#> 1st Qu.:-0.0026041 1st Qu.:-1.582e-03 1st Qu.:-2.162e-03
#> Median : 0.0000000 Median : 0.000e+00 Median : 0.000e+00
#> Mean : 0.0001341 Mean : 3.084e-05 Mean : 5.204e-05
#> 3rd Qu.: 0.0027817 3rd Qu.: 1.559e-03 3rd Qu.: 2.287e-03
#> Max. : 0.0237047 Max. : 2.162e-02 Max. : 1.091e-02
#> disabledyes ruralitysuburban ruralityurban
#> Min. :-1.479e-02 Min. :-1.196e-02 Min. :-1.231e-02
#> 1st Qu.:-2.092e-03 1st Qu.:-2.488e-03 1st Qu.:-2.503e-03
#> Median : 0.000e+00 Median : 0.000e+00 Median : 0.000e+00
#> Mean :-4.745e-05 Mean : 3.356e-05 Mean : 4.363e-05
#> 3rd Qu.: 1.676e-03 3rd Qu.: 2.274e-03 3rd Qu.: 2.448e-03
#> Max. : 1.661e-02 Max. : 1.210e-02 Max. : 1.185e-02
#> racethNon-Hispanic Black racethNon-Hispanic White seslow
#> Min. :-2.185e-02 Min. :-2.160e-02 Min. :-2.612e-02
#> 1st Qu.:-6.046e-04 1st Qu.:-1.616e-03 1st Qu.:-2.683e-03
#> Median : 0.000e+00 Median : 0.000e+00 Median : 0.000e+00
#> Mean : 8.155e-05 Mean : 5.104e-05 Mean :-8.383e-06
#> 3rd Qu.: 6.658e-04 3rd Qu.: 1.704e-03 3rd Qu.: 2.441e-03
#> Max. : 2.195e-02 Max. : 1.916e-02 Max. : 2.736e-02
#> sesmedium
#> Min. :-2.538e-02
#> 1st Qu.:-1.426e-03
#> Median : 0.000e+00
#> Mean :-3.055e-05
#> 3rd Qu.: 1.679e-03
#> Max. : 2.527e-02
Listing / Output 10.31: Compute dfbeta
These dfbeta values are quite different to the dfbetas.
Warning 10.3: dfbetas suitable for Generalized Linear Models (GLMs)?
I am not sure if the dfbeta values are — beside of linear models computed with stats::lm() — also valid for generalized linear models, computed with stats::glm(). My doubts are nurtured on the fact that in the help file all computed dfbeta and dfbetas require a method of class lm. This is in contrast to Cook’s D where a method for both classes lm and glm exists.
See also the explanation of the argument ìnfl`:
influence structure as returned by lm.influence or influence (the latter only for the glm method of rstudent and cooks.distance).
10.10.4 Cook’s Distance
Cook’s Distance, is computed in a similar way to dfbeta(s) — each observation is removed and the model is re-estimated without it. Cook’s D then combines the differences between the models with and without an observation for all the parameters together instead of one at a time like the dfbeta(s). A high Cook’s D would indicate that removing the observation made a big difference and therefore it might be considered influential.
Depending if I compute it with NA’s or without NA’s I got 12 or six influential values. This is in contrast to Listing / Output 10.29, where I couldn’t detect any influential values.
But I cannot see which row in the data these six values belong. I think that would be necessary to inspect and/or delete these values.
10.10.5 Compute leverage
Leverage is the influence that the observed value of the outcome has on the predicted value of the outcome. Leverage values range between 0 and 1. To determine which leverage values indicate influential observations, a cutoff of \(\frac{2p}{n}\)is often used. In this case, we have 13 parameters, so the cutoff is \(\frac{2 \times 13}{1427}= 0.01822004\)
Instead of 35 values I got 95 influential values. The reason for the difference is that there are 1601 records and not 1427 as calculated in the book.
But all in all I still think that both results 35 and 95 are wrong, because the variable infl10.3$is.inf has not even a single TRUE value for any measures as already shown in Listing / Output 10.29.
TODO: Look for other tutorials how to compute influential values
I think that there are several reasons why the computation of the influential values in the book has some inaccuracies:
There is a misunderstanding in dfbeta and dfbetas. In the book both measures are treated the same, but there is a difference. (BTW: Both measures are written without a dash as is the case in the book.)
In my opinion it is not established that all of the lm-measures work the same with glm. In the help file there are several hints that there exist differences.
The output of stats::influential.measures() as explained in the examples of the help files does not show any TRUEs for the diagnosis variable is.inf.
And last not least: There is no clear advice what I should do when I detect influential values. The books says that one should inspect the values and if there was a coding error to correct or remove the data. But if it was no coding error, should I stick with the influential values? Even if I notice a pattern of the influential errors such as we have seen with distances in Texas in Chapter 9?
There are many tutorials explaining how to work with outliers and influential values. But these texts — often buried in textbooks — are complex. I need to dedicate some time to read and to try out their data examples. But from the big importance that this subject has, I will explores some of these tutorials.
TODO 10.2: Look for other tutorials how to compute influential values
Remark 10.2. : Difference between outliers and influential points
Outliers are observed data points that are far from the least squares line. They have large “errors”, where the “error” or residual is the vertical distance from the line to the point.
Besides outliers, a sample may contain one or a few points that are called influential points. Influential points are observed data points that are far from the other observed data points in the horizontal direction. These points may have a big effect on the slope of the regression line. To begin to identify an influential point, you can remove it from the data set and see if the slope of the regression line is changed significantly.
R Code 10.34 : Anscombe’s Quartet of ‘Identical’ Simple Linear Regressions
Code
## Now, do what you should have done in the first place: PLOTSff<-y~xmods<-setNames(as.list(1:4), paste0("lm", 1:4))for(iin1:4){ff[2:3]<-lapply(paste0(c("y", "x"), i), as.name)mods[[i]]<-lmi<-lm(ff, data =anscombe)}op<-par( mfrow =c(2, 2), mar =0.1+c(4, 4, 1, 1), oma =c(0, 0, 2, 0))for(iin1:4){ff[2:3]<-lapply(paste0(c("y", "x"), i), as.name)plot(ff, data =anscombe, col ="red", pch =21, bg ="orange", cex =1.2, xlim =c(3, 19), ylim =c(3, 13))abline(mods[[i]], col ="blue")}mtext("Anscombe's 4 Regression data sets", outer =TRUE, cex =1.5)
Listing / Output 10.34: Anscombe’s 4 Regression data sets
Listing / Output 10.34 shows the difference between outlier and influential point very clearly:
The point at the far right top of panel x4 is an outlier without any influence because it lies on the regression line.
The point at the right top of panel x3 is an outlier and at the same time an influential point, because it changes the regression line.
10.11 Achievement 8: Using the model for predictions
Logistic regression models are useful not only for examining relationships between predictors and binary outcomes, but also for predicting probabilities for hypothetical or new cases that are not in the data frame.
R Code 10.35 : Using the model to predict probabilities for observations that are outside the data set
Code
# make a new data frame containing the observations of interestglm10.3_example<-tibble::tibble(age =c(35, 65, 68), sex =c("male", "female", "male"), educ =c("Four-year degree or more","Four-year degree or more","Four-year degree or more"), disabled =c("no", "no", "no"), parent =c("not parent", "parent", "parent"), rurality =c("rural", "rural", "rural"), raceth =c("Non-Hispanic White","Non-Hispanic White","Non-Hispanic White"), ses =c("low", "medium", "medium"))# use the new data frame to predictpredictions<-predict(object =glm10.3, newdata =glm10.3_example, type ="response")predictions
#> 1 2 3
#> 0.5135547 0.6466977 0.4648331
Listing / Output 10.35: Predict probabilities for observations that are outside the data set
10.12 Achievement 9: Adding and interpreting interaction terms
An interaction term examines how two (or more) variables might work together to influence an outcome. A possible interaction between sex and parental status can be included in the logistic regression model by adding a term, \(+ sex \times parent\), to the formula. When an interaction is included in a model, it is customary to also include the interacting variables separately. In a model with interaction terms, the terms that are not part of the interaction are called main effects.
10.12.1 Computing interaction
R Code 10.36 : Computing interaction parental status and sex
Code
# estimate the library use model for interaction term and print resultsglm10.4<-glm(formula =uses.lib~age+sex+educ+parent+disabled+rurality+ses+raceth+sex*parent, data =tbl10.4, family =binomial("logit"))odds.n.ends::odds.n.ends(mod =glm10.4)
Listing / Output 10.36: Generalized linear model glm10.4 with interaction parental status and sex
10.12.2 NHST Step 1
Write the null and alternate hypotheses:
Wording for H0 and HA
H0: A model containing age, sex, education, parent status, disability status, rurality, SES, race-ethnicity, and an interaction between sex and parent status is not useful in explaining library use.
HA: A model containing age, sex, education, parent status, disability status, rurality, SES, race-ethnicity, and an interaction between sex and parent status is useful in explaining library use.
10.12.3 NHST Step 2
Compute the test statistic.
The test statistic is \(χ^2 = 96.30\) with 13 degrees of freedom.
10.12.4 NHST Step 3
Review and interpret the test statistics: Calculate the probability that your test statistic is at least as big as it is if there is no relationship (i.e., the null is true).
The p-value is less than .001.
10.12.5 NHST Step 4
Conclude and write report.
The null hypothesis is rejected. The model including age, sex, education, parent status, disability status, rurality, SES, race-ethnicity, and an interaction between sex and parent status is useful in explaining library use [\(χ^2(13) = 96.30; p < .001\)].
Report 10.10: Interpreation of glm10.4 for library use with interaction parent status and sex
Age, sex, having a four-year degree or more, and non-Hispanic Black race-ethnicity were statistically significant predictors of library use. For every 1year increase in age, the odds of library use decreased by 1% (OR = .99; 95% CI: .983–.996). Males have 55% lower odds of library use compared to females (OR = .45; 95% CI: .35–.58) and those with a four-year degree have 1.93 times higher odds of library use compared to those with less than a high school education (OR = 1.93; 95% CI: 1.28–2.94). Finally, non-Hispanic Black participants had 1.56 times higher odds of library use compared to Hispanic participants (OR = 1.56; 95% CI: 1.01–2.43). Disability status, SES, disabled status, and rurality were not significantly associated with library use, and those with a high school to 2-year degree had no higher nor lower odds of library use than those with less education. Likewise, non-Hispanic White participants had no higher nor lower odds of library use compared to Hispanic participants. There was no statistically significant interaction between sex and parent status on the odds of library use.
The model correctly predicted 383 of 696 library users and 467 of 731 library nonusers. Overall, the model correctly predicted 850 of 1427 observations (59.6%); the model was more specific (63.9%) than sensitive (55.0%), indicating that it was better at classifying library nonusers than library users.
10.12.6 Check assumption
Thee is nothing new in this section.
10.13 Achievement 10: Likelihood ratio test
Given that the interaction term was not statistically significant and the model violated the linearity assumption, it seems preferable to report the previous model without the interaction term as the final model. However, there is a statistical test that can be used to determine whether a larger model is statistically significantly better than a smaller model. The test is called the likelihood ratio test.
The likelihood ratio (LR) test compares two nested models where one model includes a subset of the variables in the other model. For example, the simple logistic regression model with age as the only predictor could be compared statistically to any of the larger models because they all have the age variable in them. The small model is nested in each of the larger models. In addition, the larger model without the interaction could be compared to the model with the interaction term. Models where the variables of one are completely different from the variables of the other cannot be compared with this test.
The idea behind the LR test is to determine if the additional variables in a model make the model better enough to warrant the complexity of adding more variables to the model.
10.13.1 NHST Step 1
Write the null and alternate hypotheses:
Wording for H0 and HA
H0: The larger model with the interaction term is no better at explaining library use compared to the model without the interaction term.
HA: The larger model with the interaction term is better at explaining library use compared to the model without the interaction term.
10.13.2 NHST Step 2
Compute the test statistic.
The {lmtest} package (Section A.45) has the lrtest() function, which can be used to compare two nested models. The LR test computes the difference between the log-likelihoods for the two models and multiplies this by 2; the result has a chi-squared distribution.
#> Likelihood ratio test
#>
#> Model 1: uses.lib ~ age + sex + educ + parent + disabled + rurality +
#> raceth + ses
#> Model 2: uses.lib ~ age + sex + educ + parent + disabled + rurality +
#> ses + raceth + sex * parent
#> #Df LogLik Df Chisq Pr(>Chisq)
#> 1 13 -941.32
#> 2 14 -940.54 1 1.5599 0.2117
Listing / Output 10.37: Likelihood ratio test
The output from lmtest::lrtest() shows the test statistic is \(χ^2 = 1.56\).
10.13.3 NHST Step 3
Review and interpret the test statistics: Calculate the probability that your test statistic is at least as big as it is if there is no relationship (i.e., the null is true).
The test statistic of \(χ^2 = 1.56\) had a p-value of .21.
10.13.4 NHST Step 4
Conclude and write report.
Report 10.11: Report the result of the likelihood ratio test
The null hypothesis was retained; the model with the interaction term was no different in explaining library use from the model without the interaction term (\(χ^2 = 1.56; p = .21\)).
When the larger model is not statistically significantly better, then use of the smaller model to aid in interpretation is preferred. The more complex a model becomes, the more difficult it is to interpret. Generally speaking, parsimony is preferable. However, there are exceptions to this, such as when the larger model has variables in it that have been consistently related to the outcome in other research or are important to understanding the outcome for some other reason.
10.13.5 Complete interpretation of the final model
Report 10.12: Interpretation of the final model
Model chosen
A logistic regression model with age, sex, education, parent status, socioeconomic status, race-ethnicity, and disability status was statistically significantly better than a baseline model at explaining library use [\(χ^2(12) = 94.74; p < .001\)].
A likelihood ratio test comparing this model to a model that also included an interaction between sex and parent status showed that the larger model was not statistically significantly better than the smaller model [\(χ^2(1) = 1.56; p = .21\)], so the smaller model was retained.
Odds of library use
The odds of library use were 51% lower for males compared to females (\(OR = .49; 95% CI: .39–.61\)).
The odds of library use were 1.90 times higher for those with a four-year degree compared to those with less than a high school education (\(OR = 1.90; 95% CI: 1.26–2.90\)).
The odds of library use are 1.55 times higher for non-Hispanic Black participants compared to Hispanic participants (\(OR = 1.55; 95% CI: 1.002–2.42\)).
The odds of library use are 1% lower for every 1-year increase in a person’s age (\(OR = .99; 95% CI: .984–.996\)).
The odds of library use are not statistically significantly different for urban or suburban residents compared to rural residents, for parents compared to nonparents, for non-Hispanic Whites compared to Hispanics, or for people with low or medium SES compared to high SES.
Checked assumptions and model diagnostics
Assumption checking revealed a possible problem with the linearity of the age predictor, especially at the youngest ages. The other assumptions were met.
Diagnostics found two problematic outlying or influential observations, but the observations did not appear to be data entry mistakes or much different from the rest of the sample in any way.
Example 10.9 : Forest plot for reporting odds ratios
R Code 10.38 : Forest plot for reporting odds ratios (Version 1)
Code
# get odds ratio table from glm10.3# not [4] as in the book, but [6]df10.3_odds<-data.frame(odds.n.ends::odds.n.ends(mod =glm10.3)[6])
#> Waiting for profiling to be done...
Code
# change variable names for easier usenames(df10.3_odds)<-c("OR", "lower", "upper")# make row names a variabletbl10.3_odds<-df10.3_odds|>tibble::rownames_to_column(var ="variable")# forest plot of odds ratios from glm10.3 (Figure 10.15)tbl10.3_odds|>ggplot2::ggplot(ggplot2::aes( x =variable, y =OR, ymin =lower, ymax =upper))+ggplot2::geom_pointrange(color ="#7463AC")+ggplot2::geom_hline( yintercept =1, lty =2, color ="deeppink", linewidth =1)+ggplot2::coord_flip()+ggplot2::labs(x ="Variable from library use model", y ="Odds ratio (95% CI)")
Listing / Output 10.38: Association between demographic characteristics and library use from Pew Research Center 2016 library use survey (Version 1)
R Code 10.39 : Forest plot for reporting odds ratios (Version 2)
Code
# clean variable names for graphtbl10.3_odds_clean<-tbl10.3_odds|>dplyr::mutate( variable =dplyr::recode(.x =variable,"sexmale"="Male","ruralityurban"="Urban residence","ruralitysuburban"="Suburban residence","parentparent"="Parent","educHS to 2-year degree"="HS to 2-year degree","educFour-year degree or more"="Four-year degree or more","disabledyes"="Disabled","age"="Age","seslow"="Low socioeconomic status","sesmedium"="Medium socioeconomic status","racethNon-Hispanic White"="Non-Hispanic White","racethNon-Hispanic Black"="Non-Hispanic Black","(Intercept)"="Intercept"))# modify graph to include clean variable names (Figure 10.16)# change scale of y-axis (flipped) to log scale for visualizationtbl10.3_odds_clean|>ggplot2::ggplot(ggplot2::aes( x =variable, y =OR, ymin =lower, ymax =upper))+ggplot2::geom_pointrange(color ="#7463AC")+ggplot2::geom_hline( yintercept =1, lty =2, color ="deeppink", linewidth =1)+ggplot2::scale_y_log10( breaks =c(0.1, 0.2, 0.5, 1.0, 2.0, 5.0, 10), minor_breaks =NULL)+ggplot2::coord_flip()+ggplot2::labs( x ="Variable from library use model", y ="Odds ratio (95% CI)")
Listing / Output 10.39: Association between demographic characteristics and library use from Pew Research Center 2016 library use survey (Version 2)
R Code 10.40 : Forest plot for reporting odds ratios (Final version)
Code
tbl10.3_odds_clean|>ggplot2::ggplot(ggplot2::aes(x =reorder(variable, OR), y =OR, ymin =lower, ymax =upper))+ggplot2::geom_pointrange(color ="#7463AC")+ggplot2::geom_hline( yintercept =1, lty =2, color ="deeppink", linewidth =1)+ggplot2::scale_y_log10( breaks =c(0.1, 0.2, 0.5, 1.0, 2.0, 5.0, 10), minor_breaks =NULL)+ggplot2::coord_flip()+ggplot2::labs( x ="Variable from library use model", y ="Odds ratio (95% CI)")
Listing / Output 10.40: Association between demographic characteristics and library use from Pew Research Center 2016 library use survey (Final version)
10.14 Exercises (empty)
10.15 Glossary
term
definition
Adjusted-count-R2
Adjusted count R-squared is a measure of model fit for binary logistic regression that adjusts the percent correctly predicted (or count R-squared) by the model for the number of people in the largest outcome category. (SwR, Glossary)
B-Spline
The term "spline" refers to a wide class of basis functions that are used in applications requiring data interpolation and/or smoothing. Splines are special function defined piecewise by polynomials, it results to a smooth function built out of smaller, component functions. In interpolating problems, spline interpolation is often preferred to polynomial interpolation because it yields similar results, even when using low degree polynomials, while avoiding big numbers and the problem of oscillation at the edges of intervals of higher degrees. The term "spline" comes from the flexible [spline](https://en.wikipedia.org/wiki/Flat_spline) devices used by shipbuilders and draftsmen to draw smooth shapes. They used a long, thin piece of wood or metal that could be anchored in a few places in order to aid drawing curves. There are many types of splines, especially the common-place 'B-splines': The 'B' stands for 'basis,' which just means 'component.' B-splines build up wiggly functions from simpler less-wiggly components. Those components are called basis functions. B-splines force you to make a number of choices t
Confidence Interval
A range of values, calculated from the sample observations, that is believed, with a particular probability, to contain the true parameter value. (Cambridge Dictionary of Statistics, 4th ed., p.98)
Cook’s D
Cook’s distance (often abbreviated Cook’s D) is used in Regression Analysis to find influential outliers in a set of predictor variables. IIt is a way to identify points that negatively affect the regression model. ([Statistics How To](https://www.statisticshowto.com/cooks-distance/))
Count-R-squared
Count R-squared, or percent correctly predicted, is a measure of model fit for a logistic regression model that is the proportion of the outcome values that were correctly predicted out of all observations modeled. (SwR, Glossary)
Cumulative Distribution Function
A cumulative distribution function (CDF) tells us the probability that a random variable takes on a value less than or equal to x. (<a href="https://www.statology.org/cdf-vs-pdf/">Statology</a>) It sums all parts of the distribution, replacing a lot of calculus work. The CDF takes in a value and returns the probability of getting that value or lower. (BF, Chap.13) A CDF is a hypothetical model of a distribution, the ECDF models empirical (i.e. observed) data. (<a href="https://www.statisticshowto.com/empirical-distribution-function/">Statistics How To</a>)
dfbeta
dfbeta are effects on coefficients of deleting each observation in turn. ("Companion to Applied Regression", car package, help file)
dfbetas
dfbetas are effect on coefficients of deleting each observation in turn, standardized by a deleted estimate of the coefficient standard error ("Companion to Applied Regression", car package, help file)
Digital Divide
The term "digital divide" refers to the gap between individuals, households, businesses and geographic areas at different socio-economic levels with regard to their opportunities to access information and communication technologies (ICTs). (<a href= "https://www.oecd-ilibrary.org/science-and-technology/understanding-the-digital-divide_236405667766">OECD Library</a>)
EDA
Explorative Data Analysis is an approach of analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling and thereby contrasts traditional hypothesis testing. (<a href="https://en.wikipedia.org/wiki/Exploratory_data_analysis">Wikipedia</a>)
GLM
A generalized linear model (GLM) is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value. (<a href="https://en.wikipedia.org/w/index.php?title=Generalized_linear_model&oldid=1175448680">Wikipedia</a>). the binary logistic regression model is one example of a generalized linear model. (SwR, Glossary)
Goodness-of-fit
The chi-squared goodness-of-fit test is used for comparing the values of a single categorical variable to values from a hypothesized or population variable. The goodness-of-fit test is often used when trying to determine if a sample is a good representation of the population. (SwR, Chap 5)
GVIF
The generalized variance inflation factor (GVIF) is a generalized version of the variance inflation factor (VIF) that is used to identify problems with multicollinearity; the GVIF is used in binary logistic regression. (SwR, Glossary)
Influential Observation
An influential observation is an observation that changes the slope of a regression line. (SwR, Glossary) Influential points are observed data points that are far from the other observed data points in the horizontal direction. These points may have a big effect on the slope of the regression line. To begin to identify an influential point, you can remove it from the data set and see if the slope of the regression line is changed significantly. (<a href= "https://openstax.org/books/introductory-statistics/pages/12-6-outliers">Introductory Statistics 12.6</a>)
Likelihood-ratio
The likelihood ratio test is a test that compares two nested binary logistic regression models to determine which is a better fit to the data; the difference between two log-likelihoods follows a chi-squared distribution with a significant result indicating the larger model is a better fitting model. (SwR, Glossary)
Linear Regression
Linear regression is used to predict the value of an outcome variable Y based on one or more input predictor variables X. The aim is to establish a linear relationship (a mathematical formula) between the predictor variable(s) and the response variable, so that, we can use this formula to estimate the value of the response Y, when only the predictors (Xs) values are known. ([r-statistics.co](https://r-statistics.co/Linear-Regression.html)) (Chap.4)
Linearity
Linearity is the assumption of some statistical models that requires the outcome, or transformed outcome, to have a linear relationship with numeric predictors, where linear relationships are relationships that are evenly distributed around a line. (SwR, Glossary)
Logit Transformations
Logit transformations are transformations that takes the log value of p/(1-p); this transformation is often used to normalize percentage data and is used in the logistic model to transform the outcome. (SwR, Glossary)
Odds Ratio
Odds is usually defined in statistics as the probability an event will occur divided by the probability that it will not occur. An odds ratio (OR) is a measure of association between a certain property A and a second property B in a population. Specifically, it tells you how the presence or absence of property A has an effect on the presence or absence of property B. (<a href="https://www.statisticshowto.com/probability-and-statistics/probability-main-index/odds-ratio/">Statistics How To</a>). An odds ratio is a ratio of two ratios. They quantify the strength of the relationship between two conditions. They indicate how likely an outcome is to occur in one context relative to another. (<a href="https://statisticsbyjim.com/probability/odds-ratio/">Statistics by Jim</a>)
Outcome
Outcome is the variable being explained or predicted by a model; in linear and logistic regression, the outcome variable is on the left-hand side of the equal sign. (SwR, Glossary)
Outliers
Outliers are observations with unusual values. (SwR, Glossary). Outliers are observed data points that are far from the least squares line. They have large "errors", where the "error" or residual is the vertical distance from the line to the point. (<a href= "https://openstax.org/books/introductory-statistics/pages/12-6-outliers">Introductory Statistics 12.6</a>)
p-value
The p-value is the probability that the test statistic is at least as big as it is under the null hypothesis (SwR, Glossary)
PDF
A probability density function (PDF) describes a probability distribution for a random, continuous variable. Use a probability density function to find the chances that the value of a random variable will occur within a range of values that you specify. More specifically, a PDF is a function where its integral for an interval provides the probability of a value occurring in that interval. (<a href="https://statisticsbyjim.com/probability/probability-density-function/">Statistics By Jim</a>)
Predictor Variable
Predictor variable -- also known sometimes as the independent or explanatory variable -- is the counterpart to the response or dependent variable. Predictor variables are used to make predictions for dependent variables. ([DeepAI](https://deepai.org/machine-learning-glossary-and-terms/predictor-variable), [MiniTab](https://support.minitab.com/en-us/minitab/21/help-and-how-to/statistical-modeling/regression/supporting-topics/basics/what-are-response-and-predictor-variables/))
QRP
Questionable Research Practice (QRP) is a research practice that introduces bias, usually in pursuit of statistical significance; an example of such practices might be dropping or recoding values or variables solely to improve a model fit statistic. (SwR, GLossary)
Quantile
Quantiles are cut points dividing the range of a probability distribution into continuous intervals with equal probabilities (<a href="https://en.wikipedia.org/wiki/Quantile">Wikipedia</a>)
Residually
Residuals are the differences between the observed values and the predicted values. (SwR, Glossary)
Sensitivity
The true positive rate (one minus the false negative rate), is referred to as sensitivity, recall, or probability of detection. (Bayesian Thinking, Chap.1). But also the percentage of “yes” values or 1s a logistic regression model got right. (SwR, Glossary)
Sigmoid
A sigmoid function is any mathematical function whose graph has a characteristic S-shaped curve or sigmoid curve. (<a href="https://en.wikipedia.org/wiki/Sigmoid_function">Wikipedia</a>)
Specificity
The true negative rate (one minus the false positive rate), is referred to as specificity. (Bayesian Thinking, Chap.1) But also the percentage of “no” values or 0s a logistic regression model predicted correctly. (SwR, Glossary)
Standard Deviation
The standard deviation is a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean (also called the expected value) of the set, while a high standard deviation indicates that the values are spread out over a wider range. The standard deviation is the square root of its variance. A useful property of the standard deviation is that, unlike the variance, it is expressed in the same unit as the data. Standard deviation may be abbreviated SD, and is most commonly represented in mathematical texts and equations by the lower case Greek letter $\sigma$ (sigma), for the population standard deviation, or the Latin letter $s$ for the sample standard deviation. ([Wikipedia](https://en.wikipedia.org/wiki/Standard_deviation))
Standard Error
The standard error (SE) of a statistic is the standard deviation of its [sampling distribution]. If the statistic is the sample mean, it is called the standard error of the mean (SEM). (<a href="https://en.wikipedia.org/wiki/Standard_error">Wikipedia</a>) The standard error is a measure of variability that estimates how much variability there is in a population based on the variability in the sample and the size of the sample. (SwR, Glossary)
VIF
The variance inflation factor (VIF) is a statistic for determining whether there is problematic multicollinearity in a linear regression model. (SwR, Glossary)
Wald
Wald test is the statistical test for comparing the value of the coefficient in linear or logistic regression to the hypothesized value of zero; the form is similar to a one-sample t-test, although some Wald tests use a t-statistic and others use a z-statistic as the test statistic. (SwR, Glossary)
Z-score
A z-score (also called a standard score) gives you an idea of how far from the mean a data point is. But more technically it’s a measure of how many standard deviations below or above the population mean a raw score is. (<a href="https://www.statisticshowto.com/probability-and-statistics/z-score/#Whatisazscore">StatisticsHowTo</a>)
McElreath, Richard. 2020. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. 2nd ed. CRC Texts in Statistical Science. Boca Raton: Taylor; Francis, CRC Press. https://doi.org/10.1201/9780429029608.
# Binary logistic regression {#sec-chap10}```{r}#| label: setup#| include: falseoptions(warn =0) # default value: change for debugging. See: ?warningbase::source(file ="R/helper.R")ggplot2::theme_set(ggplot2::theme_bw()) ```## Achievements to unlock::: {#obj-chap10}::: my-objectives::: my-objectives-headerObjectives for chapter 10:::::: my-objectives-container**SwR Achievements**- **Achievement 1**: Using exploratory data analysis before developing a logistic regression models (@sec-chap10-achievement1)- **Achievement 2**: Understanding the binary logistic regression statistical model (@sec-chap10-achievement2)- **Achievement 3**: Estimating a simple logistic regression model and interpreting predictor significance and interpretation (@sec-chap10-achievement3)- **Achievement 4**: Computing and interpreting two measures of model fit (@sec-chap10-achievement4)- **Achievement 5**: Estimating a larger logistic regression model with categorical and continuous predictors (@sec-chap10-achievement5)- **Achievement 6**: Interpreting the results of a larger logistic regression model (@sec-chap10-achievement6)- **Achievement 7**: Checking logistic regression assumptions and using diagnostics to identify outliers and influential values (@sec-chap10-achievement7)- **Achievement 8**: Using the model to predict probabilities for observations that are outside the data set (@sec-chap10-achievement8)- **Achievement 9**: Adding and interpreting interaction terms in logistic regression (@sec-chap10-achievement9)- **Achievement 10**: Using the likelihood ratio test to compare two nested logistic regression models (@sec-chap10-achievement10)::::::Achievements for chapter 10:::## The perplexing libraries problemHarris defines `r glossary("digital divide")` broader and includes both limited accessto information and communication technologies (ICT) and a deficit in theability to use information gained through access toICTs[^10-logistic-regression-1]. [^10-logistic-regression-1]: My current research indicates that mostly only the first problem (limited access) falls under the definition ([@wikipedia2024; @oecd2001].People were more likely to fall into this digital divide if they were poor, a racial minority, had limited education, had a disability, or lived in an area with low population density. The digital divide often exacerbated other problems like finding an employment either by not searching relevant offers using the internet or not getting the job because of missing ICT skills.The question this chapter tries to answer: "Which characteristics are associated with library use?"## Resources & Chapter Outline### Data, codebook, and R packages {#sec-chap10-data-codebook-packages}::: my-resource::: my-resource-headerData, codebook, and R packages for learning about descriptive statistics:::::: my-resource-container**Data**Two options for accessing the data:1. Download the cleaned data set `pew_libraries_2016_cleaned_ch10.csv` from <https://edge.sagepub.com/harris1e>. 2. Follow the instructions in Box 10.1 to import and clean `pew_libraries_2016_ch10.csv` from <https://edge.sagepub.com/harris1e> or download from the original Internet data source and clean.I am using the first option because there is nothing new for me to import and clean data files.**Codebook**Two options:1. Download the `pew_libraries_2016_codebook_ch10.docx` codebook file from <https://edge.sagepub.com/harris1e>.2. Use the version that comes with the raw data file from Pew Research Center (https://www.pewinternet.org/dataset/march2016-libraries/)**Packages**1. Packages used with the book (sorted alphabetically)- {**car**}: @sec-car (John Fox)- {**lmtest**}: @sec-lmtest (Achim Zeileis) - {**odds.n.ends**}: @sec-odds.n.ends (Jenine Harris) - {**tableone**}: @sec-tableone (Kazuki Yoshida) - {**tidyverse**}: @sec-tidyverse (Hadley Wickham)2. My additional packages (sorted alphabetically)- {**skimr**}: @sec-skimr (Elin Waring)::::::## Achievement 1: EDA {#sec-chap10-achievement1}### Get, show, and recode data:::::{.my-example}:::{.my-example-header}:::::: {#exm-chap09-eda}: EDA: Get, show and recode data:::::::::::::{.my-example-container}::: {.panel-tabset}###### Get data:::::{.my-r-code}:::{.my-r-code-header}:::::: {#cnj-chap10-get-data}: Import data from .csv file:::::::::::::{.my-r-code-container}::: {#lst-chap10-get-data}```{r}#| label: get-data#| eval: false## run only once (manually)tbl10 <- readr::read_csv("data/chap10/pew_libraries_2016_cleaned_ch10.csv",col_types ="nffffffff")save_data_file("chap10", tbl10, "tbl10.rds")```Get data for chapter 10:::(*For this R code chunk is no output available*)***In my first import trial it turned out that all the factor variables are imported as character variables. So I had to add the columns specifications `col_types = "nffffffff"`.:::::::::###### Show data:::::{.my-r-code}:::{.my-r-code-header}:::::: {#cnj-chap10-show-data}: Show raw data for chapter 10:::::::::::::{.my-r-code-container}::: {#lst-chap10-show-data} ```{r}#| label: show-datatbl10 <- base::readRDS("data/chap10/tbl10.rds")skimr::skim(tbl10)```Skim raw data for chapter 10:::***I have used the {**skimr**} package instead of {**tableone**}. It wouldn't be necessary to plot a histogram for `age` to decide if the mean or median has to be used. The tiny histogram at the right side of the `age` line already shows that age is not normally distributed. But for the sake of practice I will create the histogram in the next tab.:::::::::###### age:::::{.my-r-code}:::{.my-r-code-header}:::::: {#cnj-chap10-age-hist}: Show age distribution:::::::::::::{.my-r-code-container}::: {#lst-chap10-age-dist}```{r}#| label: age-distmy_hist_dnorm(tbl10, tbl10$age)```The distribution of age in the 2016 Pew Research Center library use data set::::::::::::###### {tableone}:::::{.my-r-code}:::{.my-r-code-header}:::::: {#cnj-chap10-desc-stats}: Table of descriptive statistics:::::::::::::{.my-r-code-container}::: {#lst-chap10-desc-stats}```{r}#| label: desc-stats#| cache: truetab_desc <- tableone::CreateTableOne(data = tbl10,strata ='uses.lib',vars =c("age", "sex", "parent", "disabled","ses", "raceth", "educ", "rurality"))print(tab_desc,nonnormal ='age',showAllLevels =TRUE)```Descriptive statistics with bivariate tests using {**tableone**}:::***:::::{.my-remark}:::{.my-remark-header}:::::: {#rem-chap10-test-all-variables-together}: Printing bivariate tests for all variables --- a `r glossary("QRP")`:::::::::::::{.my-remark-container}I am not feeling comfortable to use {**tableone**} to print descriptive statistics with bivariate test for all variables. Besides the mentioned danger of a `r glossary("QRP", "questionable research practice")` (QRP) in looking for statistically significance I would like to inspect the relationships more slowly and to see more details. I think at a minimum one should examine plots of the bivariate correlations. I have the same skepticism about the advice to "examine the frequencies and percentages in the table to identify some possible categories that may be driving the significant results for the bivariate tests". I think one should be guided to inspect more in detail in the first instance by theoretical assumptions, an approach that is fairly well demonstrated with Bayesian model design in "Statistical Rethinking" [@mcelreath2020]. To facilitate `r glossary("EDA", "exploratory data analysis")` one could use packages that combine the analysis of different variables in one go, as my experiments with `GGally::ggpairs()` in @lst-chap09-plot-ggpairs or with `ggfortify::autoplot()` in @lst-chap09-test-ggfortify have shown. But this approach gets too overwhelming when there are more than 5-6 variables as I will demonstrate in @lst-chap10-bivariate-data with {**GGally**).::::::::::::::::::###### {GGally}:::::{.my-r-code}:::{.my-r-code-header}:::::: {#cnj-chap10-bivariate-eda}: Bivariate exploratory data analysis for variables of chapter 10:::::::::::::{.my-r-code-container}::: {#lst-chap10-bivariate-data}```{r}#| label: plot-ggpairs#| fig-height: 10#| fig-width: 10#| cache: true#| warning: falsetbl10 |> GGally::ggpairs()```Bivariate exploratory data analysis for variables of chapter 10:::*** There are too many plots (variables) in this example. I could divide easily the amount of variables into different patches as demonstrated in [Columns and Mapping](https://ggobi.github.io/ggally/articles/ggpairs.html#columns-and-mapping) and inspect these results in more details. But in order to get all variations I have to plan the approach systematically which destroys the advantage of automatic plotting.:::::::::::::::::::::## Achievement 2: Understanding binary logistic regression {#sec-chap10-achievement2}Binary logistic regression follows a similar format and process as linear regression (@sec-chap09), but the outcome or dependent variable is binary. Because the outcome is binary, or categorical consisting of two categories, the model predicts the probability that a person is in one of the two categories. In this chapter we want to predict the library use `uses.lib`.Because of the binary outcome the linear regression model would not work since it requires a continuous outcome. However, the `r glossary("linear regression")` statistical model can be transformed using `r glossary("logit transformations")` in order to be useful for modeling binary outcomes.### Formula of the logistic model:::::{.my-theorem}:::{.my-theorem-header}:::::: {#thm-chap10-logistic-model}: Formula for the statistical form of the logistic model:::::::::::::{.my-theorem-container}$$\begin{align*}p(y) = \frac{1}{1+e^{-(b_{0}+b_{1}x_{1}+b_{2}x_{2})}}\end{align*}$$ {#eq-chap10-logistic-model}***- $y$: binary outcome variable (e.g., library use) - $p(y)$: probability of the outcome (e.g., probability of library use) - $b_{0}$: y-intercept - $x_{1}$ and $x_{2}$: predictors of the outcome (e.g., age, rurality) - $b_{1}$ and $b_{2}$: coefficients for $x_{1}$ and $x_{2}$:::::::::### Logistic functionThe logistic function has a `r glossary("sigmoid")` shape that stretches from $–∞$ to $∞$ on the x-axis and from $0$ to $1$ on the y-axis. The function can take any value along the x-axis and give the corresponding value between $0$ and $1$ on the y-axis.:::::{.my-theorem}:::{.my-theorem-header}:::::: {#thm-chap10-logistic-function}: Formula of the logistic function:::::::::::::{.my-theorem-container}$$\begin{align*}\sigma(t) &= \frac{e^t}{1+e^t} =\\&= \frac{1}{1 + e^{-t}}\end{align*}$$ {#eq-chap10-logistic-function}***$t$: value along the $x$-axis of the function$\sigma$: value of $y$ for a specific value of $t$, or the probability of $y$ given $t$.In the case of logistic regression, the value of $t$ will be the right-hand side of the regression model, which looks something like $β_{0} + β_{1}x$, where $x$ is an independent variable, $β_{1}$ is the coefficient (rather than slope) for that variable, and $β_{0}$ is the constant (rather than $y$-intercept).Substituting this regression model for $t$ in the logistic function:$$\begin{align*}p(y) = \frac{1}{1 + e^{-(β_{0} + β_{1}x)}}\end{align*}$$ {#eq-chap10-logistic-function}::::::::::::::{.my-example}:::{.my-example-header}:::::: {#exm-chap10-logistic-function}: Drawing shape of logistic function empty and with example data points:::::::::::::{.my-example-container}::: {.panel-tabset}###### Shape:::::{.my-r-code}:::{.my-r-code-header}:::::: {#cnj-chap10-logistic-function-shape}: Shape of the logistic function:::::::::::::{.my-r-code-container}::: {#lst-chap10-logistic-function-shape}```{r}#| label: logistic-function-shapeggplot2::ggplot(data =data.frame(x =c(-5, 5)), ggplot2::aes(x)) + ggplot2::stat_function(fun =function(x) base::exp(x)/(1+ base::exp(x)), n =100,linewidth =1, ggplot2::aes(color ="Logistic function") ) + ggplot2::scale_x_continuous(labels =seq.int(10, 35, length.out =5) ) + ggplot2::scale_color_manual(name ="",values ="hotpink2" ) + ggplot2::labs(x ="Values of input",y ="Value of outcome" )```Shape of the logistic function::::::::::::###### Example with data points:::::{.my-r-code}:::{.my-r-code-header}:::::: {#cnj-chap10-logistic-function-data}: Example of logistic function with data points:::::::::::::{.my-r-code-container}::: {#lst-chap10-logistic-function-data} ```{r}#| label: logistic-function-data## generate fake data framex1 =c(10.2, 13.0, 14.1, 14.3, 14.5, 14.7, 15.0, 15.5, 16.1, 17.3, 19.0, 19.2, 19.8, 21.0, 26.5)y1 =rep(0, 15)x2 =c(17.8, 18.2, 19.0, 21.4, 21.5, 22.7, 24.0, 27.2, 31.0, 32.4, 33.8)y2 =rep(1, 11)tbl <- tibble::tibble(x =c(x1, x2),y =c(y1, y2)) |> dplyr::arrange(x)## draw logistic function with faked data pointsggplot2::ggplot( data = tbl, ggplot2::aes(x = x, y = y,color ="Logistic function") ) + ggplot2::stat_smooth(data = tbl,formula = y ~ x,method ="glm",se =FALSE,method.args =list(family = binomial) ) + ggplot2::geom_point( ggplot2::aes(alpha ="Observation"),color ="grey41" ) + ggplot2::labs(x ="Values of input",y ="Value of outcome" ) + ggplot2::scale_color_manual(name ="",values ="hotpink2" ) + ggplot2::scale_alpha_manual(name ="",values =0.5 )```Example of logistic function with data points:::***:::::::::###### Probability of outcome:::::{.my-r-code}:::{.my-r-code-header}:::::: {#cnj-chap10-logistic-function-data-annotated}: Example of logistic function showing probability of outcome for x = 20:::::::::::::{.my-r-code-container}::: {#lst-chap10-logistic-function-data-annotated} ```{r}#| label: logistic-function-data-annotated## generate fake data framex1 =c(10.2, 13.0, 14.1, 14.3, 14.5, 14.7, 15.0, 15.5, 16.1, 17.3, 19.0, 19.2, 19.8, 21.0, 26.5)y1 =rep(0, 15)x2 =c(17.8, 18.2, 19.0, 21.4, 21.5, 22.7, 24.0, 27.2, 31.0, 32.4, 33.8)y2 =rep(1, 11)tbl <- tibble::tibble(x =c(x1, x2),y =c(y1, y2)) |> dplyr::arrange(x)## draw logistic function with faked data pointsggplot2::ggplot( data = tbl, ggplot2::aes(x = x, y = y,color ="Logistic function") ) + ggplot2::stat_smooth(data = tbl,formula = y ~ x,method ="glm",se =FALSE,method.args =list(family = binomial), ggplot2::aes(linetype ="predictor = 20 and\noutcome = .44 example") ) + ggplot2::geom_point( ggplot2::aes(alpha ="Observation"),color ="grey41" ) + ggplot2::geom_segment(x =20, xend =20,y =0, yend = .44,color ="#1f6fca",linetype ="dashed" ) + ggplot2::geom_segment(x =10, xend =20,y = .44, yend = .44,color ="#1f6fca",linetype ="dashed" ) + ggplot2::labs(x ="Values of input",y ="Value of outcome" ) + ggplot2::scale_color_manual(name ="",values ="hotpink2" ) + ggplot2::scale_alpha_manual(name ="",values =0.5 ) + ggplot2::scale_linetype_manual(name ="",values =c("dashed", "dashed") ) + ggplot2::annotate("text", x =10, y = .48, label ="0.44", color ="#1f6fca" ) + ggplot2::annotate("text", x =20.5, y =-.02, label ="20", color ="#1f6fca" )```Example of logistic function showing a probability of outcome for $x = 20$:::***Let's assume that the above graphic is a model for predicting library use from age. Then we can interpret it as a 44% probability of library use for a 20-year-old. Since 44% is lower than a 50% probability of the value of $y$, the model is predicting that the 20-year-old does not have the outcome. So, if the outcome is library use, the logistic model would predict this 20-year-old was not a library user. ::::::::::::::::::::::::::{.my-theorem}:::{.my-theorem-header}:::::: {#thm-chap10-odds}: Formula of odds related to probability:::::::::::::{.my-theorem-container}$$\begin{align*}odds = \frac{probability}{1-probability}\end{align*}$$ {#eq-chap10-odds}Substituting the logistic model from @eq-chap10-logistic-model:$$\begin{align*}odds &= \frac{\frac{1}{1+e^{-(\beta_{0}+\beta_{1}x)}}}{1- \frac{1}{1+e^{-(\beta_{0}+\beta_{1}x)}}} = e^{\beta_{0} + \beta_{1}x}\end{align*}$$ {#eq-chap10-odds-logistic-model}To be equivalent to the interpretation of the coefficients in linear regression, however, there is one more step. That is, what is the increase or decrease in the odds of the outcome with a one-unit increase in $x$?$$\begin{align*}OR = \frac{e^{b_0+b_{1}(x+1)}}{e^{b_0+b_{1}x}}\end{align*}$$ {#eq-chap10-odds-ratio}:::::::::@eq-chap10-odds-ratio shows that for every one-unit increase in the independent variable $x$, the odds of the outcome increase or decrease by $e^{b_1}$. Taking $e$ to the power of $b_1$ is referred to as exponentiating $b_1.$ After a model is estimated, the analyst will usually exponentiate the b value(s) in order to report odds ratios describing the relationships between each predictor and the outcome.## Achievement 3: Interpreting a simple logistic regression {#sec-chap10-achievement3}### NHST#### NHST Step 1Write the null and alternate hypotheses:::: {.callout-note}- **H0**: The model is no better than the baseline at predicting library use.- **HA**: The model is better than the baseline at predicting library use.:::#### NHST Step 2Compute the test statistic. The generalized linear model (`r glossary("GLM")`) or `stats::glm()` function can be used to estimate a binary logistic regression model. The model is generalized because it starts with the basic linear model and generalizes to other situations.The `stats::glm()` function will treat the first category as the reference group (the group *without* the outcome) and the second category as the group *with* the outcome. To see the order of `uses.lib`, use `base::levels()` to show the levels in order and use `stats:relevel()` to re-order levels.:::::{.my-example}:::{.my-example-header}:::::: {#exm-ID-text}: Numbered Example Title:::::::::::::{.my-example-container}::: {.panel-tabset}###### Reorder `uses.lib`:::::{.my-r-code}:::{.my-r-code-header}:::::: {#cnj-chap10-uses-lib-order}: Reverse the order of `uses.lib`:::::::::::::{.my-r-code-container}::: {#lst-chap10-uses-lib-order}```{r}#| label: uses-lib-order#| results: holdglue::glue("Order of uses.lib levels originally")base::levels(tbl10$uses.lib)tbl10.1<- tbl10 |> dplyr::mutate(uses.lib = forcats::fct_rev(uses.lib) )save_data_file("chap10", tbl10.1, "tbl10.1.rds")glue::glue("")glue::glue("---------------------------------------")glue::glue("Order of uses.lib levels reversed")base::levels(tbl10.1$uses.lib)```Reversing order of `uses.lib`::::::::::::In contrast to the book the order of my levels for the `uses.lib` data is reversed. It makes more sense to ask about if one uses the library, e.g., "yes" has to be the second category. To get the same order as in the book, I had to reverse the order. Instead of using `stats:relevel()` I reversed the order with `forcats::fct_rev()`.###### glm10.1:::::{.my-r-code}:::{.my-r-code-header}:::::: {#cnj-chap10-glm10.1}: Estimate library use model and print results:::::::::::::{.my-r-code-container}::: {#lst-chap10-glm10.1} ```{r}#| label: glm10.1glm10.1<-glm(formula = uses.lib ~ age, data = tbl10.1, family =binomial(link ="logit") ) save_data_file("chap10", glm10.1, "glm10.1.rds")base::summary(glm10.1)```Estimate the simple library use model and print (= summmarize) the results:::***@lst-chap10-glm10.1 uses `stats::family = binomial(link = "logit")`to specify the outcome variable. The "family" argument provides a description of the error distribution and link function to be used in the model. This is a convenient way to specify the details of the models and to distinguish between different types of generalized linear models that are appropriate for different kinds of outcome variables. In addition to the `glm()` help page you get additional information if you type [?family](https://rdrr.io/r/stats/family.html) into the console.:::::::::::: {.callout-warning #wrn-chap10-pkg-name-stats1}##### Problem with `stats::`-prefix in front of `glm()` in later R codeBecause of an error in the coding of the DescTools::PseudoR2() function I had to delete my `stats::`-prefix in front of the `glm()` function in @lst-chap10-glm10.1. (See @wrn-chap10-pkg-name-stats2):::###### Odds Ratio:::::{.my-r-code}:::{.my-r-code-header}:::::: {#cnj-chap10-compute-odds-ratio}: Get model fit, model significance, and odds ratios:::::::::::::{.my-r-code-container}::: {#lst-chap10-compute-odds-ratio}```{r}#| label: compute-odds-ratioodds.n.ends::odds.n.ends(glm10.1)```Model fit, model significance, and odds ratios::::::::::::::::::::::::***#### NHST Step 3Review and interpret the test statistics: Calculate the probability that your test statistic is at least as big as it is if there is no relationship (i.e., the null is true).:::::{.my-example}:::{.my-example-header}:::::: {#exm-chap10-chi-squared-dist}: Chi-squared distribution with df = 1:::::::::::::{.my-example-container}::: {.panel-tabset}###### Theory: df = 1:::::{.my-r-code}:::{.my-r-code-header}:::::: {#cnj-chap10-chi-squared-graph}: Theoretical chi-squared distribution with df = 1 :::::::::::::{.my-r-code-container}::: {#lst-chap10-chi-squared-graph}```{r}#| label: chi-squared-graph-theoryggplot2::ggplot() + ggplot2::xlim(0, 12) + ggplot2::stat_function(fun = dchisq,args =list(df =1),linewidth =1,color ="hotpink2" ) + ggplot2::labs(x ="Chi-squared statistic",y ="Probability density" )```Theoretical chi-squared distribution with df = 1 :::Graphing the probability of the chi-squared distribution with 1 degree of freedom to show that the probability that the chi-squared would be 10.815 or higher --- if the null hypothesis were true --- is the very small area under the curve from 10.815 to the right.To get a better look at the position $x = 10.815$ let's zoom into the critical region. See @lst-chap10-chi-squared-graph-zoomed.:::::::::###### Theory: df = 1, zoomed:::::{.my-r-code}:::{.my-r-code-header}:::::: {#cnj-chap10-chi-squared-graph-zoomed}: Theoretical chi-squared distribution with df = 1, zoomed into the critical region of p = 10.815:::::::::::::{.my-r-code-container}::: {#lst-chap10-chi-squared-graph-zoomed}```{r}#| label: chi-squared-graph-zoomed## Define start of shadex_shade =10.815df =1y_shade = stats::dchisq(x_shade, df)## Define sequence of x-valuestib <- tibble::tibble(x =seq(9, 20, length.out =300)) |># Compute density values dplyr::mutate(y = stats::dchisq(x, df) )## Subset data for shaded areashaded_area <- tib |> dplyr::filter(x >= x_shade) |>## Necessary as starting point for y = 0! tibble::add_row(x = x_shade, y =0, .before =1)tib |>## Plot the Chi-square distribution: df = 1 ggplot2::ggplot(ggplot2::aes(x = x, y = y)) + ggplot2::geom_line(linewidth =1,color ="hotpink2" ) +## Draw segment ggplot2::geom_segment(x = x_shade,y =0,xend = x_shade,yend = y_shade ) +## Shade curve ggplot2::geom_polygon(data = shaded_area, fill ="lightblue", ggplot2::aes(x = x, y = y) ) + ggplot2::labs(x ="Chi-squared statistic",y ="Probability density" ) + ggplot2::annotate(geom ="text",x =11.5,y = .00058,label = base::round(stats::dchisq(x_shade, df), 6) )```Theoretical chi-squared distribution with df = 1, zoomed into the critical region :::Here we see that at the chi-square statistic of 10.815 the p-value is with .0005 very tiny and about 100 times much smaller as the level of statistical significance of .05. :::::::::When I developed above plots I wondered why in the book the graph is a rough line and the sample size (without NA's) of $n = 1501$ was mentioned. I learned that my curve was the theoretical curve whereas I need the randomized values to get a more rough simulation that is nearer the factual data.###### Randomized:::::{.my-r-code}:::{.my-r-code-header}:::::: {#cnj-chi-squared-graph-randomized}: Randomized chi-squared distribution with df = 1, n = 1571:::::::::::::{.my-r-code-container}::: {#lst-chi-squared-graph-randomized} ```{r}#| label: chi-squared-graph-randomizedset.seed(42)n =1571df =1chisq_dist <- tibble::tibble(x = stats::rchisq(n, df))chisq_dist |> ggplot2::ggplot( ggplot2::aes(x) ) + ggplot2::geom_density(linewidth =1,color ="hotpink2" ) + ggplot2::labs(x ="Chi-squared statistic",y ="Probability density" )```Randomized chi-squared distribution with df = 1, n = 1571:::This is a randomized chi-squared distribution. I could not shade values $p > 10.815$ because there were only three rows with values greater than that. ::::::::::::::: {#tdo-chap10-empiricial-chisq-dist}:::::{.my-checklist}:::{.my-checklist-header}TODO: I wonder if I could provide an empirical chi-squared distribution with the actual data.::::::::How to plot an empirical chi-squared distribution with real data?::::::::::::::::::::{.my-resource}:::{.my-resource-header}:::::: {#lem-chap10-chisq-video}: Video about the chi-squared distribution:::::::::::::{.my-resource-container}I found a lengthy video tutorial (19:31) using RStudio featuring chi-squared distribution. You will see, `r glossary("pdf")`, `r glossary("cumulative distribution function", "cdf")`, `r glossary("quantile")`, plotting, examples, independence, `r glossary("Goodness-of-fit", "goodness of fit")` and explanations. There is also a accompanying web page with text and the code.:::::::::#### NHST Step 4Conclude and write report.::: {.callout #rep-chap10-chisq-glm10.1}The chi-squared test statistic for a logistic regression model with age predicting library use had a p-value of .001. This p-value indicates there is a .1% chance of a chi-squared statistic this large or larger if the null hypothesis were true. The null hypothesis is therefore rejected in favor of the alternate hypothesis that the model is better than the baseline at predicting library use. A logistic regression model including age was statistically significantly better than a null model at predicting library use [$χ^2(1) = 10.82; p = .001$].:::### Interpreting odds ratio significanceBoth, `base::summary()` and `odds.n.ends::odds.n.ends()` included values and significance statistics for the age predictor. The `odds.n.ends()` output included `r glossary("odds ratio", "odds ratios")`, which are easier to interpret. The outcome is transformed by the logistic function, therefore the coefficients from `summary()` are not easy to interpret directly.Looking at @lst-chap10-compute-odds-ratio we see that `age`has a odds ratio of .99. This could be interpreted as, “The odds of library use decrease by 1% for every 1-year increase in age.” The 1% comes from subtracting the odds ratio of .99 from 1 and multiplying by 100 to convert it to a percent. This is a stategy to make interpreting odds below 1 easier. Otherwise we would have to say: “The odds of library use are .99 times as high with every 1-year increase in age.”**When the confidence interval includes 1, the odds of the outcome are not statistically significantly different with a change in the independent variable.**The odds ratio for the age variable is less than 1, and the confidence interval does not include 1. Thus, the interpretation of this odds ratio would be as follows: ::: {.callout #rep-chap10-odds-ratio}##### Interpretation of age odds ratioThe odds of library use are 1% lower for every 1-year increase in age ($OR = .99; 95% CI: .986–.996$).:::#### NHST Step 1Write the null and alternate hypotheses:::: {.callout-note}- **H0**: Library use is not associated with age.- **HA**: Library use is associated with age.:::#### NHST Step 2Compute the test statistic. The `base::summary()` function following a `stats::glm()` function includes a `r glossary("z-score", "z-statistic")` comparing the coefficient estimate to zero. The z-score is a measure how many standard deviations away a value is from a mean. But for logistic regression, the coefficient divided by the `r glossary("standard error")` --- not by the `r glossary("standard deviation")` --- follows a z-distribution. This z-statistic is the test statistic for the `r glossary("Wald", "Wald test")`, which has the same purpose (but follows a different distribution) as the Wald test from @sec-chap09-achievement4.In the case of the `age` variable, dividing the estimate of $-0.008838$ by its standard error of $0.002697$ gives a z-statistic of $-3.276974$, which is well beyond the boundary of $–1.96$ for statistical significance when $\alpha$ is set at $.05$.The z-distribution is a normal distribution with a mean of 0 and a standard deviation of 1.:::::{.my-r-code}:::{.my-r-code-header}:::::: {#cnj-chap10-z-dist-odds-ratio}: z-distribution for sample size of 1571:::::::::::::{.my-r-code-container}::: {#lst-chap10-z-dist-odds-ratio}```{r}#| label: z-dist-odds-rationhstplot::plotztest(z =-3.28,tails ="one",xmax =5 )```z-distribution for sample size of 1571::::::::::::#### NHST Step 3Review and interpret the test statistics: Calculate the probability that your test statistic is at least as big as it is if there is no relationship (i.e., the null is true).The area under the curve to the left of the test statistic in @lst-chap10-z-dist-odds-ratio is very small. Given this area is very small, the p-value of .00105 in the output @lst-chap10-glm10.1 from `base::summary(glm10.1)` makes sense. There is a .105% probability that this sample came from a population where there was no relationship between age and library use. Thus, there is a statistically significant relationship between age and library use (z = –3.28; p = .001).#### NHST Step 4Conclude and write report.The odds ratio for age is .99 with a 95% CI of .986–.996. The confidence interval shows the range where the odds ratio likely is in the population. Because the confidence interval does not include 1, this indicates that the odds ratio is statistically significantly different from 1. The interpretation would be as follows:::: {.callout #rep-odds-ratio-significance}##### Interpreting odds ratio significance The null hypothesis of no relationship between library use and age is rejected. The odds of library use are 1% lower for every 1-year increase in age in the sample (OR = .99; 95% CI: .986–.996). The 95% confidence interval indicates that the odds of library use are .4%–1.4% lower with each 1-year increase in age in the population that the sample came from.:::## Achievement 4: Interpreting two measures of model fit {#sec-chap10-achievement4}For linear regression, the $R^2$ statistic measured how well the model fit the observed data by measuring how much of the variability in the outcome was explained by the model. The concept of variance is appropriate for continuous but not categorical variables. There are several measures for categorical model fit, including the percent correctly predicted, sometimes called `r glossary("count-R-squared", "count $R^2$")` and the `r glossary("Adjusted-count-R2", "adjusted count $R^2$")`, which adjusts the count $R^2$ for the number of people in the largest of the two categories of the outcome.### Percent correctly predictedThe percent correctly predicted by the model is computed using the predicted probabilities, or fitted values, for each of the observations and comparing these probabilities to the true value of the outcome.:::::{.my-theorem}:::{.my-theorem-header}:::::: {#thm-chap10-percent-correctly-predicted}: Percent correctly predicted or count $R^2$:::::::::::::{.my-theorem-container}$$\begin{align*}R^2_{count} = \frac{n_{correct}}{n_{total}}\end{align*}$$ {#eq-chap10-percent-correctly-predicted}:::::::::For example, if a person in the data set were predicted to have a chance of more than 50% of library use, this would be transformed into a “yes” or “1” value of the outcome and then compared to the person’s actual library use. If the predicted value and the true value matched, this would be considered a correct prediction. The same vice-versa for predicted values less than 50%. The total number of people the model gets correct out of the total number of people in the data analyzed is the `r glossary("count-R-squared", "percent correctly predicted")` or count $R^2$.The @lst-chap10-compute-odds-ratio includes a table showing how many observations were correctly predicted in each category of the outcome.:::::{.my-theorem}:::{.my-theorem-header}:::::: {#thm-chap10-percent-correctly-predicted-example}: Calculate the percent correctly predicted with the `glm10.1` data:::::::::::::{.my-theorem-container}$$\begin{align*}R^2_{count} = \frac{338 + 500}{1571} = \frac{838}{1571} = 0.5334\end{align*}$$ {#eq-chapXY-formula}:::::::::### Adjusted count $R^2$An alternative measure of percent correctly predicted is the `r glossary("Adjusted-count-R2", "adjusted count R^2")` measure. It adjusts the count $R^2$ for the number of people in the largest of the two categories of the outcome.:::::{.my-theorem}:::{.my-theorem-header}:::::: {#thm-chap10-adjusted-count-r-squared}: Adjusted count $R^2$:::::::::::::{.my-theorem-container}$$\begin{align*}R^2_{count.adj} = \frac{n_{correct}- n_{most.common.outcome}}{n_{total}- n_{most.common.outcome}}\end{align*}$$ {#eq-chap10-adjusted-count-r-squared}:::::::::The argument behind this adjustment is that a null model, or a model with no predictors, could get a good percent correctly predicted just by predicting everyone was in the outcome category that had the bigger percentage of people in it.For the library use data, the most common category is library nonuse (or 0), with 798 of the 1571 participants with complete data for the model. Without knowing anything about library use, you could predict everyone in the data set was a nonuser and be right $\frac{788}{1571}$ or 50.8% of the time. Using the age predictor, the model is right $\frac{338+500}{1571}$ or 53% of the time. While this is not a huge increase, it did classify 40 additional people correctly compared to using the percentages in the outcome categories with no other information.The model using age to predict library use was correct 53% of the time (Count $R^2$ = .53). The adjusted count $R^2$ would be $\frac{338+500-798}{1571-798}$ or .05.::: {.callout #rep-chap10-adjusted-r-squared}##### Interpretation of adjusted $R^2$ in `glm10.1`There were 5% more correct predictions by the age model than by the baseline (Adjusted Count $R^2$ = .05).:::### Pseudo-$R^2$Pseudo-$R^2$ is another measure that are reported relatively frequently. This measure generally tries to quantify the reduction in lack of fit between a null and full model. You can compute Pseudo-$R^2$ :::{.my-bulletbox}:::: {.my-bulletbox-header}::::: {.my-bulletbox-icon}::::::::::: {#bul-chap10-pseudo-r2-functions}::::::: Packages with function for computing pseudo-$R^2$:::::::: {.my-bulletbox-body}- with r2_mcfadden() from the {**performance**} package (see: @sec-performance)- with `PseudoR2()` from the {**DescTools**} package (see: @sec-DescTools),- with `lrm()`(instead of the `stats::glm()`!) from the {**rms**} package (see: @sec-rms)- with `pR2()`from the {**pscl**} package (see: @sec-pscl), and- with `pseudo_r2() from the {**pubh**} package (see: @sec-pubh).Many of these packages have also function for percent correctly predicted and the adjusted count $R^2$ measure.::::::::::::{.my-resource}:::{.my-resource-header}:::::: {#lem-chap10-computing-pseudo-r-squared}: Interpreting the different Pseudo-$R^2$:::::::::::::{.my-resource-container}There are many different pseudo-$R^2$ measures with different theoretical approaches. Therefore also the interpretation of these measures differ. The best resource I found to explain these differences: [FAQ: What are pseudo R-squareds?](https://stats.oarc.ucla.edu/other/mult-pkg/faq/general/faq-what-are-pseudo-r-squareds/)[@ucla2011].But even in this resource not all different pseudo-$R^2$ measures are explained. Anyway it is an complex issues where I have to spend more time to study the details.:::::::::::: {.callout-caution #cau-chap10-pseudo-r2}##### Interpreting pseudo-$R^2$Even in @lem-chap10-computing-pseudo-r-squared not all different pseudo-$R^2$ measures are explained. It is an complex issues where I have to spend more time to study the details.::::::::{.my-example}:::{.my-example-header}:::::: {#exm-chap10-pseudo-r-squared}: Computing Pseudo $R^2$ with different packages:::::::::::::{.my-example-container}::: {.panel-tabset}###### DescTools:::::{.my-r-code}:::{.my-r-code-header}:::::: {#cnj-chap10-pseudo-r-squared-desctools}: Computing Pseudo-$R^2$ :::::::::::::{.my-r-code-container}::: {#lst-chap10-pseudo-r-squared-desctools}```{r}#| label: pseudo-r-squared-desctoolsDescTools::PseudoR2(x = glm10.1,which ="all")```Computing Pseudo-$R^2$ with {**DescTools**}::::::::::::::: {.callout-warning #wrn-chap10-pkg-name-stats2}##### `DescTools::PseudoR2()` throws an error messageFrom `DescTools::PseudoR2()` I got always the error that the function `stats::glm()`was not known. After two hours of trials I learned that the function was not programmed for the case that you provide the package name in front of the `glm()`function. After I deleted this prefix in @lst-chap10-glm10.1 the `DescTools::PseudoR2()` function worked. (See also @wrn-chap10-pkg-name-stats1):::###### performance:::::{.my-r-code}:::{.my-r-code-header}:::::: {#cnj-chap10-pseudo-r-squared-performance}: Computing Pseudo-$R^2$ with {**performance**}:::::::::::::{.my-r-code-container}::: {#lst-chap10-pseudo-r-squared-performance} ```{r}#| label: pseudo-r-squared-performanceperformance::r2(glm10.1)performance::r2_tjur(glm10.1)performance::r2_mcfadden(glm10.1)performance::r2_coxsnell(glm10.1)performance::r2_nagelkerke(glm10.1)performance::r2_efron(glm10.1)performance::r2_mckelvey(glm10.1)```Computing Pseudo-$R^2$ with {**performance**}::::::::::::###### rms:::::{.my-r-code}:::{.my-r-code-header}:::::: {#cnj-chap10-pseudo-r-squared-rms}: Computing Pseudo-$R^2$ with {**rms**}:::::::::::::{.my-r-code-container}::: {#lst-chap10-pseudo-r-squared-rms}```{r}#| label: pseudo-r-squared-rmsrms::lrm(formula = uses.lib ~ age,data = tbl10.1)```Computing Pseudo-$R^2$ with {**rms**}::::::::::::###### psel:::::{.my-r-code}:::{.my-r-code-header}:::::: {#cnj-chap10-pseudo-r-squared-pscl}: Computing Pseudo-$R^2$ with {**pscl**}:::::::::::::{.my-r-code-container}::: {#lst-chap10-pseudo-r-squared-pscl}```{r}#| label: pseudo-r-squared-psclpscl::pR2(glm10.1)```Computing Pseudo-$R^2$ with {**pscl**}::::::::::::###### pubh:::::{.my-r-code}:::{.my-r-code-header}:::::: {#cnj-chap10-pseudo-r-squared-pubh}: Computing Pseudo-$R^2$ with {**pubh**}:::::::::::::{.my-r-code-container}::: {#lst-chap10-pseudo-r-squared-pubh}```{r}#| label: pseudo-r-squared-pubhpubh::pseudo_r2(glm10.1)```Computing Pseudo-$R^2$ with {**pubh**}::::::::::::::::::::::::***### Sensitivity and specificitySometimes it is useful to know whether the model is better at predicting people with the outcome or people without the outcome. The measures used for these two concepts are sensitivity and specificity.`r glossary("Sensitivity")` determines the percentage of the 1s or “yes” values the model got correct, while `r glossary("specificity")` computes the percentage of 0s or “no” values the model got correct.In @lst-chap10-compute-odds-ratio, the sensitivity is 43.7% while the specificity is 62.7%. The model was better at predicting the "no" values than the "yes" values. These percentages could also be computed from the frequency table in the output: The model predicted 500 of the 798 people in the 0 category correctly (62.7%) and 338 of the 773 in the 1 category correctly (43.7%).## Achievement 5: Estimating a larger logistic regression model {#sec-chap10-achievement5}### Compute full logistic regression model:::::{.my-example}:::{.my-example-header}:::::: {#exm-chap10-compute-full-glm}: Compute full logistic regression model:::::::::::::{.my-example-container}::: {.panel-tabset}###### glm10.2:::::{.my-r-code}:::{.my-r-code-header}:::::: {#cnj-chap10-compute-glm10.2}: Compute full logistic regression model, first try (`glm10.2`):::::::::::::{.my-r-code-container}::: {#lst-chap10-compute-glm10.2}```{r}#| label: compute-glm10.2# estimate the full library use model and print resultsglm10.2<-glm(formula = uses.lib ~ age + sex + educ + parent + disabled + rurality + raceth + ses,data = tbl10.1,na.action = na.exclude,family =binomial("logit"))odds.n.ends::odds.n.ends(glm10.2)```Full logistic regression model for library use data (`glm2`):::My values are quite different than the outcome in the book. The reason is that I have the levels ordered differently. I will order my levels so that I get the same result as in the book.:::::::::###### order levels:::::{.my-r-code}:::{.my-r-code-header}:::::: {#cnj-chap10-order-levels}: Order level to match the result from the book:::::::::::::{.my-r-code-container}::: {#lst-chap10-order-levels}```{r}#| label: order-levelstbl10.3<- tbl10.1|> dplyr::mutate(educ = forcats::fct_relevel(educ, "< HS", "Four-year degree or more","HS to 2-year degree") ) |> dplyr::mutate(parent = forcats::fct_rev(parent) ) |> dplyr::mutate(rurality = forcats::fct_relevel(rurality, "rural", "suburban", "urban") ) |> dplyr::mutate(raceth = forcats::fct_relevel(raceth, "Hispanic", "Non-Hispanic Black", "Non-Hispanic White") ) |> dplyr::mutate(ses = forcats::fct_relevel(ses, "high", "low", "medium") ) save_data_file("chap10", tbl10.3, "tbl10.3.rds")```Level ordered to match the result from the book:::::: {.callout-caution #cau-chap10-level-order}##### Why levels such as `ses` not ordered to their intrinsically ordering?I have ordered the levels to match the book outcome. But I do not understand why some of the levels (`educ`, `rurality` or `ses`) are not ordered from high to low or vice versa. The highest or lowest level is in the order of the book always in the middle.::::::::::::###### glm10.3:::::{.my-r-code}:::{.my-r-code-header}:::::: {#cnj-chap10-compute-glm10.3}: Compute full logistic regression model, final version (`glm10.3`):::::::::::::{.my-r-code-container}::: {#lst-chap10-compute-glm10.3}```{r}#| label: compute-glm10.3# estimate the full library use model # matching the level order from the book # and print results with odds.n.ends()glm10.3<-glm(formula = uses.lib ~ age + sex + educ + parent + disabled + rurality + raceth + ses,data = tbl10.3,na.action = na.exclude,family =binomial("logit"))save_data_file("chap10", glm10.3, "glm10.3.rds")odds.n.ends::odds.n.ends(glm10.3)```Compute full logistic regression model, final version (`glm10.3`):::::: {.callout-caution #cau-chap10-na-action}##### Using `na.action = na.omit` instead of `na.action = na.exlcude`To find influential values I have tried several different packages. Using {**glmtoolbox**} I received an error message when I used `na.action = na.exlcude`. In the help file to the `influence()` function I learned: > Cases omitted in the fit are omitted unless a `na.action` method was used (such as `na.exclude`) which restores them. … If a model has been fitted with `na.action = na.exclude` (see `na.exclude`), cases excluded in the fit are considered here.This is a weird behavior because I thought that `na.exclude` will exclude and not restore the NA-values. But if I tried to change therefore to `na.action = na.omit` I got an error message in @lst-chap10-check-residuals: "`res_std` must be size 1601 or 1, not 1427."Finally I decided to stick with `na.action = na.exclude` (or to delete this line completely with the same result) and not to use the {**glmtoolbox**} package.::::::::::::::::::::::::### NHST Step 1Write the null and alternate hypotheses:::: {.callout-note}- **H0**: A model containing age, sex, education, parent status, disability status, rurality, SES, and race-ethnicity is no better than the baseline at explaining library use.- **HA**: A model containing age, sex, education, parent status, disability status, rurality, SES, and race-ethnicity is better than the baseline at explaining library use.:::### NHST Step 2Compute the test statistic. The chi-squared test statistic of 94.74 was computed by the `odds.n.ends()` function.:::::{.my-r-code}:::{.my-r-code-header}:::::: {#cnj-chap10-chi-squared-glm10.2}: Visualizing a chi-squared distribution with n = 1427, df = 12 and test statistic of 94.74:::::::::::::{.my-r-code-container}::: {#lst-chap10-chi-squared-glm10.2}```{r}#| label: chi-squared-glm10.2nhstplot::plotchisqtest(chisq =94.74,df =12)```Chi-squared distribution with n = 1427, df = 12 and a test statistic of 94.74::::::::::::### NHST Step 3Review and interpret the test statistics: Calculate the probability that your test statistic is at least as big as it is if there is no relationship (i.e., the null is true).The probability that the chi-squared would be 94.74 if the full model were no better than the null model is shown as the area under the curve to the right starting at 94.74. This corresponds with a very small p-value < .001.### NHST Step 4Conclude and write report.::: {.callout #rep-chap10-glm10.2}A logistic regression model containing age, sex, education, parental status, disability status, SES, race-ethnicity, and rurality was statistically significantly better than the baseline probability at predicting library use [$χ^2(12) = 94.74; p < .001$].:::## Achievement 6: Interpreting the results of a larger logistic regression model {#sec-chap10-achievement6}### Interpreting odds ratios@lst-chap10-compute-glm10.2 includes the odds ratio which are easier to interpret as the significance statistics for the predictors. When the `r glossary("confidence interval")` includes 1, the `r glossary("odds ratio")` could be 1 and this indicates it is not statistically significantly different from 1. If the odds ratio is greater than 1 *and* the confidence interval does not include 1, the odds ratio suggests a statistically significant increase in the odds of the outcome.Factor variables with two categories were interpreted with respect to the reference group, or the group *not* shown in the output. Since the factor variables can only change by going from one group to the other, instead of the odds ratio being for a one-unit change in the predictor, it is **the change in the odds of the outcome when moving from the reference group to the other group**.For the sex variable, `male` is the group shown in the output, so interpreting the odds ratio for `sex` is as follows: “Males have 51% lower odds of library use compared to females.” She showed Leslie that the 51% came from subtracting the odds ratio of .489 from 1 (1 – .49 = .51) and multiplying by 100.**For factor variables with more than two categories each odds ratio is interpreted with respect to the reference group for that variable**. For `educ`, the group not shown is the `< HS` group, indicating that it is the reference group. The odds ratio for the Four-year degree or more group is 1.90, so the interpretation would be that individuals with a four-year degree or more have 1.90 times higher odds of library use compared to people with less than a high school education.A list of the odds ratios where the confidence interval did not include 1: - age (OR = .99; 95% CI: .984–.996) - sexmale (OR = .49; 95% CI: .39–.61) - educFour year degree or more (OR = 1.90; 95% CI: 1.26–2.90) - racethNon-Hispanic Black (OR = 1.55; 95% CI: 1.002–2.417)#### Interpreting significant odds ratios greater than 1There were two significant predictors with odds ratios greater than 1. ::: {.callout #rep-chap10-interpret-educ-raceth}##### Interpreting `educ` and `raceth` as significant odds ratios greater than 1- People with a four-year degree or more education had 1.90 times higher odds of library use compared to people with less than a high school education (OR = 1.90; 95% CI: 1.26–2.90).- People who were non-Hispanic Black had 1.55 times higher odds of library use compared to people who were Hispanic (OR = 1.55; 95% CI: 1.002–2.417).:::#### Interpreting non-significant odds ratios greater than 1When the confidence interval includes 1, it is possible that the true value of the odds ratio is 1. Therefore the values would be reported without the interpretation of higher odds. For instance:::: {.callout #rep-chap10-interpret-rurality}##### Interpreting `rurality` as a non-significant odds ratio greater than 1The odds of library use were not statistically significantly different for urban residents compared to rural residents (OR = 1.23; 95% CI: .93–1.63).:::#### Interpreting significant odds ratios less than 1The age variable and male sex both show significant odds ratios lower than 1. ::: {.callout #rep-chap10-interpret-age-sex}##### Interpreting `age` and `sex` as significant odds ratios less than 1- For age, the odds of library use are 1% lower for every 1-year increase in a person’s age (OR = .99; 95% CI: .984–.996). - For male sex, the reference group is female and the odds ratio is .49. Subtracting 1 – .49 results in .51, so the odds of library use are 51% lower for males compared to females (OR = .49; 95% CI: .39–.61).:::#### Interpreting non-significant odds ratios less than 1The low-SES category had a non-significant odds ratio of .93 (95% CI: .571.52). The interpretation would be as follows:::: {.callout #rep-chap10-interpret-ses}##### Interpreting `ses` as a non-significant odds ratio less than 1 The odds of library use are not statistically significantly different for those with low SES compared to those in the reference group of high SES (OR = .93; 95% CI: .57–1.52).:::### Compute and interpret model fit#### Count $R^2$The model correctly predicted 378 of the 696 who use the library and 482 of the 731 who do not use the library. Overall, it was correct for $\frac{378+482}{1427} = \frac{860}{1427}$ of the observations, or 60.3% of the time (count $R^2 = .603$). It was better at classifying those who do not use the library (specificity = 65.9%) than those who use the library (sensitivity = 54.3%).#### Adjusted count $R^2$$$R^2_{adj.count} = \frac{860-731}{1427-731} = .185$$The model predicted 18.5% more of the observations correctly than a baseline model would have.## Achievement 7: Checking logistic regression assumptions {#sec-chap10-achievement7}:::{.my-bulletbox}:::: {.my-bulletbox-header}::::: {.my-bulletbox-icon}::::::::::: {#bul-chap10-logistic-regression-assumption}::::::: Logistic regression assumptions:::::::: {.my-bulletbox-body}There are three assumptions for logistic regression: - independence of observations, - linearity, and - no perfect multicollinearity.:::::::### Logistic regresion assumptions#### Independence of observationThe Pew Research Center conducted a phone survey where they selected a single person in a randomly selected household. **The assumption is met**.#### LinearityFor logistic regression, the outcome variable is binary, so its relationship with another variable will never be linear. Instead of plotting the relationship of the `r glossary("outcome")` with each continuous predictor, `r glossary("linearity")` is tested by plotting the log-odds of the predicted probabilities for the outcome against each of the continuous `r glossary("Predictor Variable", "predictors")` in the model. For example, are the predicted values equally accurate for people of a younger age compared to people of an older age?:::::{.my-r-code}:::{.my-r-code-header}:::::: {#cnj-chap10-check-linearity-logistic-regression}: Checking linearity of the age variable for the model of library use `glm10.3`:::::::::::::{.my-r-code-container}::: {#lst-chap10-check-linearity-logistic-regression}```{r}#| label: check-linearity-logistic-regression#make a variable of the log-odds of the predicted valueslogit_use <-log(x = glm10.3$fitted.values / (1- glm10.3$fitted.values))# make a small data frame with the log-odds variable and the age predictortbl10_lin <- tibble::tibble(logit_use, age = glm10.3$model$age)# create a plot (Figure 10.9)tbl10_lin |> ggplot2::ggplot( ggplot2::aes(x = age, y = logit_use) ) + ggplot2::geom_point( ggplot2::aes(size ="Observation" ), color ="gray60", alpha = .6 ) + ggplot2::geom_smooth(formula = y ~s(x, bs ="cs"),method ="gam",se =FALSE, ggplot2::aes(color ="Loess curve") ) + ggplot2::geom_smooth(formula = y ~ x,method = lm, se =FALSE, ggplot2::aes(color ="linear" ) ) + ggplot2::labs(x ="Age in years", y ="Log-odds of library use predicted probability") + ggplot2::scale_color_manual(name ="Type of fit line", values =c("dodgerblue2","deeppink") ) + ggplot2::scale_size_manual(values =1.5, name ="" )```Checking linearity of the age variable for the model of library use `glm10.3`:::The graph shows the Loess curve close to the linear fit line with the exception of the youngest ages.:::::::::It is up to the analyst to determine whether the actual relationship is close enough to linear to meet this assumption. If the assumption is not met, this variable might be removed from analysis, transformed, or recoded.1. `r glossary("B-spline", "spline regression")` is one way to deal with problems of linearity in linear and logistic regression. 2. It might be worth considering that the data frame includes people as young as 16 years old; it is possible that there are different predictors of library use before adulthood and restricting the age range of the data frame to adults over 18 could be another option for addressing this deviation from linearity in the youngest survey participants.::: {.callout-note #nte-chap10-rectify-assumptions-not-met}##### Why older than 18 and not older than 25?It seems for me that there is still a problem with people 18 years old. The loess curve in @lst-chap10-check-linearity-logistic-regression only gets better with people older than 25 years.Generally I have a bad feeling with the sentence: "It is up to the analyst to determine whether the actual relationship is close enough to linear to meet this assumption." For me there seems to much subjectivity in this approach. As I am already remarked in @tdo-chap09-linearity-test: I have to look for tools, like statistical tests to get a more objective decision if the assumption of linearity is met. :::#### No perfect multicollinearityThe `r glossary("GVIF")` is similar to the `r glossary("VIF")` used for linear regression, but modified to account for the categorical outcome. It examines how well each predictor variable in the model is explained by the group of other predictor variables. If a predictor is well explained by the others, it is redundant and unnecessary. For the GVIF, often a threshold of $GVIF^{\frac{1}{2 \times df}} < 2.5$ is used as a cutoff, with values of 2.5 or higher indicating a failed multicollinearity assumption.:::::{.my-r-code}:::{.my-r-code-header}:::::: {#cnj-chap10-compute-gvif}: Compute GVIF:::::::::::::{.my-r-code-container}::: {#lst-chap10-compute-gvif}```{r}#| label: compute-gvifcar::vif(mod = glm10.3)```Compute and show GVIF values:::None of the values in the right-hand column have a value of 2.5 or higher, so there is no discernable problem with multicollinearity.:::::::::#### ConclusionOverall, the assumption checking revealed a possible problem with age as a predictor, mostly at the youngest ages. The other assumptions were met. It might be useful to restrict the age variable to adults or to transform the age variable.### Model diagnostics#### Finding outliers with standardized residualsReferring to @lst-chap09-diagnostics-standardized-residuals residuals are the distances between the predicted value of the outcome and the true value of the outcome for each person or observation in the data set. These values are standardized by computing z-scores for each one so that they follow a z-distribution. **Standardized `r glossary("residually", "residuals")` are `r glossary("z-score", "z-scores")` for the residual values.** Z-scores that are greater than 1.96 or less than –1.96 are about two standard deviations or more away from the mean of a measure. In this case, they are more than two standard deviations away from the mean residual value. Very large values of standardized residuals can indicate that the predicted value for an observation is far from the true value for that observation, indicating that an examination of that observation could be useful.:::::{.my-example}:::{.my-example-header}:::::: {#exm-chap10-standardize-residuals}: Standardize residuals and check if z-scores greater than 2:::::::::::::{.my-example-container}::: {.panel-tabset}###### Computing:::::{.my-r-code}:::{.my-r-code-header}:::::: {#cnj-chap10-check-residuals}: Compute and check residuals:::::::::::::{.my-r-code-container}::: {#lst-chap10-check-residuals}```{r}#| label: check-residuals# get standardized residuals and add to data frametbl10.4<- tbl10.3|> dplyr::mutate(res_std = stats::rstandard(model = glm10.3)) |> tibble::rowid_to_column("ID")# check the residuals for large values > 2tbl10.4|> tidyr::drop_na(res_std) |> dplyr::summarize(max_resid =max(abs(x = res_std)))save_data_file("chap10", tbl10.4, "tbl10.4.rds")```Compute and check residuals:::The maximum absolute value of any standardized residual was less than 1.96, so the standardized residuals did not reveal any outliers.:::::::::###### Plotting:::::{.my-r-code}:::{.my-r-code-header}:::::: {#cnj-chap10-graph-residuals}: Standardized residuals for library use model:::::::::::::{.my-r-code-container}::: {#lst-chap10-graph-residuals} ```{r}#| label: graph-residualstbl10.4|># tidyr::drop_na(res_std) |> ggplot2::ggplot( ggplot2::aes(x = ID,y = res_std,group ="Library use" ) ) + ggplot2::geom_jitter( ggplot2::aes(color = uses.lib ) ) + ggplot2::labs(x ="Observation number",y ="Standarized reeisudals" ) + ggokabeito::scale_okabe_ito(name ="Library use",aesthetics ="color",order =c(1,3),alpha = .6 )```Standardized residuals for library use model:::This is my reproduction of book’s Figure 10.10, using the colorblind-friendly Okabe-Ito scale.Most predicted probabilities were about one standard deviation above or below the mean predicted probability. So there were not outliers to examine.:::::::::::::::::::::### dfbeta(s)As I already mentioned in @wrn-chap09-diagnostics-df-beta the book mixes `r glossary("dfbeta")` & `r glossary("dfbetas")` together. But these two are different measures:`dfbetas` (plural) are the standardized values, with a cutoff of $2 / \sqrt{n}$. Because of standardization `dfbetas` are the recommended measure to use. - `dfbeta` (singular) values depend on the scale of the data.:::::{.my-example}:::{.my-example-header}:::::: {#exm-chap10-compute-influence-measures}: Computing dfbeta(s):::::::::::::{.my-example-container}::: {.panel-tabset}###### all together:::::{.my-r-code}:::{.my-r-code-header}:::::: {#cnj-chap10-compute-influence-measures}: Computing different influence measures:::::::::::::{.my-r-code-container}::: {#lst-chap10-compute-influence-measures}```{r}#| label: compute-influence-measures#| results: hold## get influence statistics infl10.3<- stats::influence.measures(model = glm10.3)## print influential dataglue::glue("The next two results are from 'example(influence.measure)'")glue::glue("")glue::glue("Print list of influential data")which(apply(infl10.3$is.inf, 1, any))## which observations 'are' influentialglue::glue("")glue::glue("Which observations 'are' influential?")summary(infl10.3)## print all diagnostic dataglue::glue("")glue::glue("Print a summary of all diagnostic data")base::summary(infl10.3$infmat)```Computing different influence measures:::Generally it turned out that there are no influential values! Otherwise I would have get an output where different variable results would have be signed with an asterisk.I had a hard time to understand the output, because the names of the variables do not conform to the name in the tibble. But after some trial and error I understood that I have to look for the variable name without vocals and add a combination of letters for the levels:- dfb.1_: intercept- dfb.age: age- dfb.sxml: sex-male- dfb.prnt: parent-parent- dfb.dsbl: disabled-yes- dfb.rrltys: rurality-suburban- dfb.rrltyr: rurality-rural- dfb.sslw: ses-low- dfb.ssmd: ses-medium- dfb.edom: educ-Four-year degree or more- dfb.et2d: educ-HS to 2-year degree- dfb.rN-B: raceth-Non-Hispanic Black- dfb.rN-W: raceth-Non-Hispanic WhiteThis list was confirmed after I computed `dfbetas` directly with @lst-chap10-compute-dfbetas because I got better variable names for the results.:::::::::I added to @lst-chap10-compute-influence-measures two measures provided by an additional example of the last line of the `stats::lm-influence()` help page ("For more "user level" examples, use example(influence.measures)") or respectively the examples in the `stats::influence.measures()` help page.###### dfbetas:::::{.my-r-code}:::{.my-r-code-header}:::::: {#cnj-chap10-compute-dfbetas}: Compute `dfbetas`:::::::::::::{.my-r-code-container}::: {#lst-chap10-compute-dfbetas} ```{r}#| label: compute-dfbetasbase::summary(dfbetas(model = glm10.3))```Compute `dfbetas`:::Besides the better variable names it also turned out the in @lst-chap10-compute-influence-measures the results are dfbetas and not dfbeta as I have computed in @lst-chap10-compute-dfbeta to see the differences.:::::::::###### dfbeta:::::{.my-r-code}:::{.my-r-code-header}:::::: {#cnj-chap10-compute-dfbeta}: Compute `dfbeta`:::::::::::::{.my-r-code-container}::: {#lst-chap10-compute-dfbeta} ```{r}#| label: compute-dfbetabase::summary(dfbeta(model = glm10.3))```Compute `dfbeta`:::These `dfbeta` values are quite different to the `dfbetas`.:::::::::::::::::::::::: {.callout-warning #wrn-chap10-dfbetas-glm}##### dfbetas suitable for Generalized Linear Models (GLMs)?I am not sure if the dfbeta values are --- beside of linear models computed with `stats::lm()` --- also valid for *generalized* linear models, computed with `stats::glm()`. My doubts are nurtured on the fact that in the help file all computed `dfbeta` and `dfbetas` require a method of class `lm`. This is in contrast to `r glossary("Cook’s D")` where a method for both classes `lm` and `glm` exists.See also the explanation of the argument ìnfl`: > influence structure as returned by `lm.influence` or `influence` (the latter only for the glm method of rstudent and cooks.distance).:::### Cook’s DistanceCook’s Distance, is computed in a similar way to dfbeta(s) --- each observation is removed and the model is re-estimated without it. `r glossary("Cook’s D")` then combines the differences between the models with and without an observation for all the parameters together instead of one at a time like the dfbeta(s). A high Cook’s D would indicate that removing the observation made a big difference and therefore it might be considered influential.:::::{.my-r-code}:::{.my-r-code-header}:::::: {#cnj-chap10-compute-cooks-d}: Compute Cook’s D:::::::::::::{.my-r-code-container}::: {#lst-chap10-compute-cooks-d}```{r}#| label: compute-cooks-dtibble::tibble(infl_cooks_d =cooks.distance(glm10.3)) |> tidyr::drop_na() |> dplyr::filter(infl_cooks_d >4/ dplyr::n())```Compute Cook’s D:::Depending if I compute it with NA's or without NA's I got 12 or six influential values. This is in contrast to @lst-chap10-compute-influence-measures, where I couldn't detect any influential values.But I cannot see which row in the data these six values belong. I think that would be necessary to inspect and/or delete these values.:::::::::### Compute leverage Leverage is the influence that the observed value of the outcome has on the predicted value of the outcome. Leverage values range between 0 and 1. To determine which leverage values indicate influential observations, a cutoff of $\frac{2p}{n}$is often used. In this case, we have 13 parameters, so the cutoff is $\frac{2 \times 13}{1427}= 0.01822004$:::::{.my-r-code}:::{.my-r-code-header}:::::: {#cnj-chap10-compute-hat-values}: Compute leverage (hat values):::::::::::::{.my-r-code-container}::: {#lst-chap10-compute-hat-values}```{r}#| label: compute-hat-valuesinfl_hat <- tibble::tibble(data.frame(infl10.3$infmat)) |> dplyr::select(hat) |> tidyr::drop_na() infl_hat |> dplyr::filter(hat >2*13/ dplyr::n())```Compute leverage (hat values):::Instead of 35 values I got 95 influential values. The reason for the difference is that there are 1601 records and not 1427 as calculated in the book.But all in all I still think that both results 35 and 95 are wrong, because the variable `infl10.3$is.inf` has not even a single `TRUE` value for any measures as already shown in @lst-chap10-compute-influence-measures.::::::::::::::: {#tdo-chap10-influential-values}:::::{.my-checklist}:::{.my-checklist-header}TODO: Look for other tutorials how to compute influential values:::::::{.my-checklist-container}I think that there are several reasons why the computation of the influential values in the book has some inaccuracies:- There is a misunderstanding in `dfbeta` and `dfbetas`. In the book both measures are treated the same, but there is a difference. (BTW: Both measures are written without a dash as is the case in the book.)- In my opinion it is not established that all of the `lm`-measures work the same with `glm`. In the help file there are several hints that there exist differences.- The output of `stats::influential.measures()` as explained in the examples of the help files does not show any `TRUE`s for the diagnosis variable `is.inf`. - And last not least: There is no clear advice what I should do when I detect influential values. The books says that one should inspect the values and if there was a coding error to correct or remove the data. But if it was no coding error, should I stick with the influential values? Even if I notice a pattern of the influential errors such as we have seen with distances in Texas in @sec-chap09?There are many tutorials explaining how to work with outliers and influential values. But these texts --- often buried in textbooks --- are complex. I need to dedicate some time to read and to try out their data examples. But from the big importance that this subject has, I will explores some of these tutorials.:::::::::Look for other tutorials how to compute influential values::::::::{.my-remark}:::{.my-remark-header}:::::: {#rem-chap10-outlier-vs-influential}: Difference between outliers and influential points:::::::::::::{.my-remark-container}There was no good explication in the book about the difference between `r glossary("outliers")` and `r glossary("influential observation", "influential values")`. (See for the [clarification in section 12.6](https://openstax.org/books/introductory-statistics/pages/12-6-outliers) of Introductory Statistics, [@illowsky2011, 541]).> **Outliers are observed data points that are far from the least squares line**. They have large "errors", where the "error" or residual is the vertical distance from the line to the point.> Besides outliers, a sample may contain one or a few points that are called **influential points**. Influential points are observed data points that are far from the other observed data points in the horizontal direction. These points may have a big effect on the slope of the regression line. To begin to identify an influential point, you can remove it from the data set and see if the slope of the regression line is changed significantly.:::::{.my-r-code}:::{.my-r-code-header}:::::: {#cnj-chap10-anscombe-datasets}: Anscombe's Quartet of 'Identical' Simple Linear Regressions:::::::::::::{.my-r-code-container}::: {#lst-chap10-anscombe-datasets}```{r}#| label: anscombe-datasets#| results: hold## Now, do what you should have done in the first place: PLOTSff <- y ~ xmods <-setNames(as.list(1:4), paste0("lm", 1:4))for(i in1:4) { ff[2:3] <-lapply(paste0(c("y", "x"), i), as.name) mods[[i]] <- lmi <-lm(ff, data = anscombe)}op <-par(mfrow =c(2, 2),mar =0.1+c(4, 4, 1, 1),oma =c(0, 0, 2, 0) )for (i in1:4) { ff[2:3] <-lapply(paste0(c("y", "x"), i), as.name)plot( ff,data = anscombe,col ="red",pch =21,bg ="orange",cex =1.2,xlim =c(3, 19),ylim =c(3, 13) )abline(mods[[i]], col ="blue")}mtext("Anscombe's 4 Regression data sets",outer =TRUE,cex =1.5)```Anscombe's 4 Regression data sets:::@lst-chap10-anscombe-datasets shows the difference between outlier and influential point very clearly: - The point at the far right top of panel x4 is an outlier without any influence because it lies on the regression line.- The point at the right top of panel x3 is an outlier and at the same time an influential point, because it changes the regression line.::::::::::::::::::## Achievement 8: Using the model for predictions {#sec-chap10-achievement8}Logistic regression models are useful not only for examining relationships between predictors and binary outcomes, but also for predicting probabilities for hypothetical or new cases that are not in the data frame.:::::{.my-r-code}:::{.my-r-code-header}:::::: {#cnj-chap10-model-prediction}: Using the model to predict probabilities for observations that are outside the data set:::::::::::::{.my-r-code-container}::: {#lst-chap10-model-prediction}```{r}#| label: model-prediction# make a new data frame containing the observations of interestglm10.3_example <- tibble::tibble(age =c(35, 65, 68),sex =c("male", "female", "male"),educ =c("Four-year degree or more","Four-year degree or more","Four-year degree or more"),disabled =c("no", "no", "no"),parent =c("not parent", "parent", "parent"),rurality =c("rural", "rural", "rural"),raceth =c("Non-Hispanic White","Non-Hispanic White","Non-Hispanic White"),ses =c("low", "medium", "medium"))# use the new data frame to predictpredictions <-predict(object = glm10.3, newdata = glm10.3_example,type ="response")predictions```Predict probabilities for observations that are outside the data set::::::::::::## Achievement 9: Adding and interpreting interaction terms {#sec-chap10-achievement9}An interaction term examines how two (or more) variables might work together to influence an outcome. A possible interaction between sex and parental status can be included in the logistic regression model by adding a term, $+ sex \times parent$, to the formula. When an interaction is included in a model, it is customary to also include the interacting variables separately. In a model with interaction terms, the terms that are not part of the interaction are called *main effects*.### Computing interaction:::::{.my-r-code}:::{.my-r-code-header}:::::: {#cnj-chap10-interaction-parent-sex}: Computing interaction parental status and sex:::::::::::::{.my-r-code-container}::: {#lst-chap10-interaction-parent-sex}```{r}#| label: interaction-parent-sex# estimate the library use model for interaction term and print resultsglm10.4<-glm(formula = uses.lib ~ age + sex + educ + parent + disabled + rurality + ses + raceth + sex*parent,data = tbl10.4,family =binomial("logit") )odds.n.ends::odds.n.ends(mod = glm10.4)```Generalized linear model `glm10.4` with interaction parental status and sex::::::::::::### NHST Step 1Write the null and alternate hypotheses:::: {.callout-note}- **H0**: A model containing age, sex, education, parent status, disability status, rurality, SES, race-ethnicity, and an interaction between sex and parent status is not useful in explaining library use.- **HA**: A model containing age, sex, education, parent status, disability status, rurality, SES, race-ethnicity, and an interaction between sex and parent status is useful in explaining library use.:::### NHST Step 2Compute the test statistic. The test statistic is $χ^2 = 96.30$ with 13 degrees of freedom.### NHST Step 3Review and interpret the test statistics: Calculate the probability that your test statistic is at least as big as it is if there is no relationship (i.e., the null is true).The p-value is less than .001.### NHST Step 4Conclude and write report.The null hypothesis is rejected. The model including age, sex, education, parent status, disability status, rurality, SES, race-ethnicity, and an interaction between sex and parent status is useful in explaining library use [$χ^2(13) = 96.30; p < .001$].::: {.callout #rep-chap10-interpret-glm10.4}##### Interpreation of `glm10.4` for library use with interaction parent status and sexAge, sex, having a four-year degree or more, and non-Hispanic Black race-ethnicity were statistically significant predictors of library use. For every 1year increase in age, the odds of library use decreased by 1% (OR = .99; 95% CI: .983–.996). Males have 55% lower odds of library use compared to females (OR = .45; 95% CI: .35–.58) and those with a four-year degree have 1.93 times higher odds of library use compared to those with less than a high school education (OR = 1.93; 95% CI: 1.28–2.94). Finally, non-Hispanic Black participants had 1.56 times higher odds of library use compared to Hispanic participants (OR = 1.56; 95% CI: 1.01–2.43). Disability status, SES, disabled status, and rurality were not significantly associated with library use, and those with a high school to 2-year degree had no higher nor lower odds of library use than those with less education. Likewise, non-Hispanic White participants had no higher nor lower odds of library use compared to Hispanic participants. There was no statistically significant interaction between sex and parent status on the odds of library use.The model correctly predicted 383 of 696 library users and 467 of 731 library nonusers. Overall, the model correctly predicted 850 of 1427 observations (59.6%); the model was more specific (63.9%) than sensitive (55.0%), indicating that it was better at classifying library nonusers than library users.:::### Check assumptionThee is nothing new in this section.## Achievement 10: Likelihood ratio test {#sec-chap10-achievement10}Given that the interaction term was not statistically significant and the model violated the linearity assumption, it seems preferable to report the previous model without the interaction term as the final model. However, there is a statistical test that can be used to determine whether a larger model is statistically significantly better than a smaller model. The test is called the `r glossary("likelihood-ratio", "likelihood ratio test")`.The likelihood ratio (LR) test compares two nested models where one model includes a subset of the variables in the other model. For example, the simple logistic regression model with age as the only predictor could be compared statistically to any of the larger models because they all have the age variable in them. The small model is nested in each of the larger models. In addition, the larger model without the interaction could be compared to the model with the interaction term. Models where the variables of one are completely different from the variables of the other cannot be compared with this test.The idea behind the LR test is to determine if the additional variables in a model make the model better enough to warrant the complexity of adding more variables to the model.### NHST Step 1Write the null and alternate hypotheses:::: {.callout-note}- **H0**: The larger model with the interaction term is no better at explaining library use compared to the model without the interaction term.- **HA**: The larger model with the interaction term is better at explaining library use compared to the model without the interaction term.:::### NHST Step 2Compute the test statistic. The {**lmtest**} package (@sec-lmtest) has the `lrtest()` function, which can be used to compare two nested models. The LR test computes the difference between the log-likelihoods for the two models and multiplies this by 2; the result has a chi-squared distribution.:::::{.my-r-code}:::{.my-r-code-header}:::::: {#cnj-chap10-lr-test}: Likelihood ratio test:::::::::::::{.my-r-code-container}::: {#lst-chap10-lr-test}```{r}#| label: lr-testlmtest::lrtest(object = glm10.3, glm10.4)```Likelihood ratio test:::The output from `lmtest::lrtest()` shows the test statistic is $χ^2 = 1.56$.:::::::::### NHST Step 3Review and interpret the test statistics: Calculate the probability that your test statistic is at least as big as it is if there is no relationship (i.e., the null is true).The test statistic of $χ^2 = 1.56$ had a `r glossary("p-value")` of .21.### NHST Step 4Conclude and write report.::: {.callout #rep-chap10-lrtest}##### Report the result of the likelihood ratio testThe null hypothesis was retained; the model with the interaction term was no different in explaining library use from the model without the interaction term ($χ^2 = 1.56; p = .21$).:::When the larger model is not statistically significantly better, then use of the smaller model to aid in interpretation is preferred. The more complex a model becomes, the more difficult it is to interpret. Generally speaking, parsimony is preferable. However, there are exceptions to this, such as when the larger model has variables in it that have been consistently related to the outcome in other research or are important to understanding the outcome for some other reason.### Complete interpretation of the final model::: {.callout #rep-chap10-final-interpretation}##### Interpretation of the final model**Model chosen**A logistic regression model with age, sex, education, parent status, socioeconomic status, race-ethnicity, and disability status was statistically significantly better than a baseline model at explaining library use [$χ^2(12) = 94.74; p < .001$]. A likelihood ratio test comparing this model to a model that also included an interaction between sex and parent status showed that the larger model was not statistically significantly better than the smaller model [$χ^2(1) = 1.56; p = .21$], so the smaller model was retained. **Odds of library use**- The odds of library use were 51% lower for males compared to females ($OR = .49; 95% CI: .39–.61$).- The odds of library use were 1.90 times higher for those with a four-year degree compared to those with less than a high school education ($OR = 1.90; 95% CI: 1.26–2.90$). - The odds of library use are 1.55 times higher for non-Hispanic Black participants compared to Hispanic participants ($OR = 1.55; 95% CI: 1.002–2.42$). - The odds of library use are 1% lower for every 1-year increase in a person’s age ($OR = .99; 95% CI: .984–.996$). - The odds of library use are not statistically significantly different for urban or suburban residents compared to rural residents, for parents compared to nonparents, for non-Hispanic Whites compared to Hispanics, or for people with low or medium SES compared to high SES. **Checked assumptions and model diagnostics**Assumption checking revealed a possible problem with the linearity of the age predictor, especially at the youngest ages. The other assumptions were met. Diagnostics found two problematic outlying or influential observations, but the observations did not appear to be data entry mistakes or much different from the rest of the sample in any way.::::::::{.my-example}:::{.my-example-header}:::::: {#exm-chap10-forest-plot}: Forest plot for reporting odds ratios:::::::::::::{.my-example-container}::: {.panel-tabset}###### version 1:::::{.my-r-code}:::{.my-r-code-header}:::::: {#cnj-chap10-forest-plot1}: Forest plot for reporting odds ratios (Version 1):::::::::::::{.my-r-code-container}::: {#lst-chap10-forest-plot1}```{r}#| label: forest-plot1# get odds ratio table from glm10.3# not [4] as in the book, but [6]df10.3_odds <-data.frame( odds.n.ends::odds.n.ends(mod = glm10.3)[6])# change variable names for easier usenames(df10.3_odds) <-c("OR", "lower", "upper")# make row names a variabletbl10.3_odds <- df10.3_odds |> tibble::rownames_to_column(var ="variable") # forest plot of odds ratios from glm10.3 (Figure 10.15)tbl10.3_odds |> ggplot2::ggplot( ggplot2::aes(x = variable, y = OR, ymin = lower, ymax = upper) ) + ggplot2::geom_pointrange(color ="#7463AC") + ggplot2::geom_hline(yintercept =1, lty =2, color ="deeppink",linewidth =1) + ggplot2::coord_flip() + ggplot2::labs(x ="Variable from library use model", y ="Odds ratio (95% CI)") ```Association between demographic characteristics and library use from Pew Research Center 2016 library use survey (Version 1)::::::::::::###### version 2:::::{.my-r-code}:::{.my-r-code-header}:::::: {#cnj-chap10-forest-plot2}: Forest plot for reporting odds ratios (Version 2):::::::::::::{.my-r-code-container}::: {#lst-chap10-forest-plot2} ```{r}#| label: forest-plot2# clean variable names for graphtbl10.3_odds_clean <- tbl10.3_odds |> dplyr::mutate(variable = dplyr::recode(.x = variable,"sexmale"="Male","ruralityurban"="Urban residence","ruralitysuburban"="Suburban residence","parentparent"="Parent","educHS to 2-year degree"="HS to 2-year degree","educFour-year degree or more"="Four-year degree or more","disabledyes"="Disabled","age"="Age","seslow"="Low socioeconomic status","sesmedium"="Medium socioeconomic status","racethNon-Hispanic White"="Non-Hispanic White","racethNon-Hispanic Black"="Non-Hispanic Black","(Intercept)"="Intercept") )# modify graph to include clean variable names (Figure 10.16)# change scale of y-axis (flipped) to log scale for visualizationtbl10.3_odds_clean |> ggplot2::ggplot( ggplot2::aes(x = variable, y = OR, ymin = lower, ymax = upper)) + ggplot2::geom_pointrange(color ="#7463AC") + ggplot2::geom_hline(yintercept =1, lty =2, color ="deeppink", linewidth =1) + ggplot2::scale_y_log10(breaks =c(0.1, 0.2, 0.5, 1.0, 2.0, 5.0, 10),minor_breaks =NULL) + ggplot2::coord_flip() + ggplot2::labs(x ="Variable from library use model", y ="Odds ratio (95% CI)") ```Association between demographic characteristics and library use from Pew Research Center 2016 library use survey (Version 2)::::::::::::###### final:::::{.my-r-code}:::{.my-r-code-header}:::::: {#cnj-chap10-forest-plot3}: Forest plot for reporting odds ratios (Final version):::::::::::::{.my-r-code-container}::: {#lst-chap10-forest-plot3}```{r}#| label: forest-plot3tbl10.3_odds_clean |> ggplot2::ggplot( ggplot2::aes(x =reorder(variable, OR), y = OR, ymin = lower, ymax = upper)) + ggplot2::geom_pointrange(color ="#7463AC") + ggplot2::geom_hline(yintercept =1, lty =2, color ="deeppink", linewidth =1) + ggplot2::scale_y_log10(breaks =c(0.1, 0.2, 0.5, 1.0, 2.0, 5.0, 10), minor_breaks =NULL) + ggplot2::coord_flip() + ggplot2::labs(x ="Variable from library use model", y ="Odds ratio (95% CI)")```Association between demographic characteristics and library use from Pew Research Center 2016 library use survey (Final version)::::::::::::::::::::::::## Exercises (empty)## Glossary```{r}#| label: glossary-table#| echo: falseglossary_table()```------------------------------------------------------------------------## Session Info {.unnumbered}::: my-r-code::: my-r-code-headerSession Info:::::: my-r-code-container```{r}#| label: session-infosessioninfo::session_info()```::::::