11 Multinomial and ordinal logistic regression – Notes on Statistics with R (SwR)

11.1 Achievements to unlock

Objectives for chapter 11

SwR Achievements

Achievement 1: Using exploratory data analysis for multinomial logistic regression (Section 11.4)
Achievement 2: Estimating and interpreting a multinomial logistic regression model (Section 11.5)
Achievement 3: Checking assumptions for multinomial logistic regression (Section 11.6)
Achievement 4: Using exploratory data analysis for ordinal logistic regression (Section 11.7)
Achievement 5: Estimating and interpreting an ordinal logistic regression models (Section 11.8)
Achievement 6: Checking assumptions for ordinal logistic regression (Section 11.9)

Objectives 11.1: Achievements for chapter 11

11.2 The diversity dilemma in STEM

There is a lack of diversity in the science, technology, engineering, and math (STEM) fields and specifically about the lack of women. There are fewer women college graduates in computer science and math jobs now compared to 15 years ago.

There are three main reasons cited for fewer women in STEM:

beliefs about natural ability,
societal and cultural norms,
and institutional barriers.

One thing that appeared to encourage women in STEM is the visibility of other women in STEM careers.

11.3 Resources & Chapter Outline

11.3.1 Data, codebook, and R packages

Data, codebook, and R packages for learning about descriptive statistics

Data

Three options for accessing the data:

Download and save the original SAS file stem-nsf-2017-ch11.xpt from https://edge.sagepub.com/harris1e and run the code in the first code chunk to clean the data.
Download and save the original SAS file stem-nsf-2017-ch11.xpt from https://edge.sagepub.com/harris1e and follow the steps in Box 11.1 to clean the data.
Download and save the original 2017 National Survey of College Graduates data from the National Science Foundation’s SESTAT Data Tool (https://ncsesdata.nsf.gov/datadownload/) and follow Box 11.1 to clean the data.

As there is nothing new for me in the recoding procedures I will go for the first option.

Codebook

Two options for accessing the codebook:

Download the stem-nsf-2017-ch11-codebook.pdf from https://edge.sagepub.com/harris1e
Use the version that comes when downloading the raw data file from the National Science Foundation’s SESTAT Data Tool (https://ncsesdata.nsf.gov/datadownload/)

Packages

Packages used with the book (sorted alphabetically)

{Hmsic}: Section A.39 (Frank Harrell)
{MASS}: Section A.47 (Brian Ripley)
{mlogit}: Section A.48 (Yves Croissant)
{nnet}: Section A.55 (Brian Ripley)
{ordinal} Section A.60 (Rune Haubo Bojesen Christensen)
{scales} Section A.78 (Thomas Lin Pedersen) (not mentioned in the book)
{tableone} Section A.88 (Kazuki Yoshida)
{tidyverse}: Section A.93 (Hadley Wickham)

My additional packages (sorted alphabetically)

{dfidx}: Section A.16 (Yves Croissant)
{glue}: Section A.34 (Jennifer Bryan)
{haven}: Section A.38 (Hadley Wickham)
{skimr}: Section A.84 (Elin Waring)

11.3.2 Get & recode data

R Code 11.1 : Get and recode data for chapter 11

Code

## using zip file because of GitHub file limit of 100 MB
tbl11 <- utils::unzip(
    zipfile = "data/chap11/stem-nsf-2017-ch11.xpt.zip", 
    files = "stem-nsf-2017-ch11.xpt"
    ) |> 
    Hmisc::sasxport.get()


# function to recode the satisfaction variables
RecSatis <- function(x){
  return(forcats::fct_recode(x,
                "Very satisfied" = "1" ,
                "Somewhat satisfied" = "2",
                "Somewhat dissatisfied" = "3",
                "Very dissatisfied" = "4",
                NULL = "L")
        )
}

# recode and rename
tbl11.1 <- tbl11  |> 
  dplyr::select(n2ocprmg, satadv, satsal, satsoc, gender, age)  |> 
  dplyr::mutate(job_cat = forcats::as_factor(
                    forcats::fct_recode(.f = n2ocprmg,
                          "CS, Math, Eng" = "1",
                          "Other Sciences" = "2",
                          "Other Sciences" = "3",
                          "Other Sciences" = "4",
                          "CS, Math, Eng" = "5",
                          NULL = "6",
                          "Nonscience" = "7",
                          NULL  = "8"
                          )
                    )
                ) |> 
  dplyr::mutate(satis_advance = RecSatis(x = satadv))  |> 
  dplyr::mutate(satis_salary = RecSatis(x = satsal))  |> 
  dplyr::mutate(satis_contrib = RecSatis(x = satsoc))  |> 
  dplyr::mutate(sex = dplyr::recode(.x = gender, "M" = "Male", "F"= "Female"))  |> 
  dplyr::mutate(sex = forcats::fct_relevel(.f = sex, c("Male", "Female"))) |> 
  dplyr::mutate(age = as.numeric(x = age))  |> 
  dplyr::select(-n2ocprmg, -satadv, -satsal, -satsoc, -gender) |> 
  haven::zap_label()

# # make sure the reordering worked
# # re-order to have male first for ref group
# print(levels(x = tbl11.1$sex))

save_data_file("chap11", tbl11.1, "tbl11.1.rds")

Listing / Output 11.1: Get and recode data for chapter 11

(For this R code chunk is no output available)

At first I thought that there are no new things in recoding the data for chapter 11. But it turned out that there are some issues to report:

Using haven::read_xpt() instead of Hmisc::sasxport.get()

This was an error in two respects:

The variable names were not converted to lower case. So I had to change all variable names in the following recoding.
The {haven} function was extremely slow! In contrast to sasxport.get() with 3.94 seconds the file export took 104.12 second, e.g. more than 26 times longer!

I therefore returned to the much fast solution with Hmisc::sasxport.get().

Labelled data

I got with the export of a SAS transport file a labelled data frame with very long labels. After the recoding I lost many of these labels except of two columns. Therefore I used haven::zap_label() to remove all variable labels.

The original data file is too big for GitHub

After I committed to GitHub I got an error, because the file stem-nsf-2017-ch11.xpt was too large. I compressed it as .zip-file and changed the import code slightly to adapt this change. After testing that it worked I deleted the uncompressed file.

Using the {forcats} package

I changed the factor levels in the RecSatis() function with the {forcats} package.

11.3.3 Show raw data

Example 11.1 : Show summary of recoded data for chapter 11

tbl11.1
sample

R Code 11.2 : Show recoded data for chapter 11 (tbl11.1)

Code

tbl11.1 <- base::readRDS("data/chap11/tbl11.1.rds")

glue::glue("********** Summarizing with base:summary() **************")
base::summary(tbl11.1)

glue::glue("  ")
glue::glue("  ")
glue::glue("********** Summarizing with skimr::skim() **************")
glue::glue("  ")
skimr::skim(tbl11.1)

#> ********** Summarizing with base:summary() **************
#>       age                  job_cat                    satis_advance  
#>  Min.   :20.00   CS, Math, Eng :17223   Very satisfied       :16980  
#>  1st Qu.:32.00   Other Sciences: 7642   Somewhat satisfied   :30443  
#>  Median :44.00   Nonscience    :31043   Somewhat dissatisfied:15632  
#>  Mean   :45.53   NA's          :27764   Very dissatisfied    : 6576  
#>  3rd Qu.:58.00                          NA's                 :14041  
#>  Max.   :75.00                                                       
#>                 satis_salary                 satis_contrib       sex       
#>  Very satisfied       :21073   Very satisfied       :35124   Male  :45470  
#>  Somewhat satisfied   :34453   Somewhat satisfied   :25341   Female:38202  
#>  Somewhat dissatisfied: 9854   Somewhat dissatisfied: 6774                 
#>  Very dissatisfied    : 4251   Very dissatisfied    : 2392                 
#>  NA's                 :14041   NA's                 :14041                 
#>                                                                            
#>   
#>   
#> ********** Summarizing with skimr::skim() **************
#>

Data summary
Name	tbl11.1
Number of rows	83672
Number of columns	6
_______________________
Column type frequency:
factor	5
numeric	1
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
job_cat	27764	0.67	FALSE	3	Non: 31043, CS,: 17223, Oth: 7642
satis_advance	14041	0.83	FALSE	4	Som: 30443, Ver: 16980, Som: 15632, Ver: 6576
satis_salary	14041	0.83	FALSE	4	Som: 34453, Ver: 21073, Som: 9854, Ver: 4251
satis_contrib	14041	0.83	FALSE	4	Ver: 35124, Som: 25341, Som: 6774, Ver: 2392
sex	0	1.00	FALSE	2	Mal: 45470, Fem: 38202

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
age	0	1	45.53	14.3	20	32	44	58	75	▆▇▆▆▃

Listing / Output 11.2: Show recoded data for chapter 11 (tbl11.1)

job_cat: Job caktegory of current job. n2ocprmg was the original variable name. Recoded into three categories:
- CS, Math, Eng = Computer science, math, and engineering fields
- Other sciences = Other science fields
- Nonscience = Not a science field
satis_advance: Satisfaction with advancement opportunity. satadv was the original variable name. 4-point Likert scale from 4 = very dissatisfied to 1 = very satisfied.
satis_salary: Satisfaction with salary. satsal was the original variable name. 4-point Likert scale from 4 = very dissatisfied to 1 = very satisfied.
satis.contrib: Satisfaction with contribution to society. satsoc was the original variable name. 4-point Likert scale from 4 = very dissatisfied to 1 = very satisfied
sex: gender was the original variable name. Two categories: Female, Male
age: Age in years, not recoded or renamed

R Code 11.3 : Sample 1500 cases: 500 from each job category

Code

tbl11.1 <- base::readRDS("data/chap11/tbl11.1.rds")

base::set.seed(seed = 143)
# take a sample of 1500 cases
# 500 from each job.cat category
tbl11.2 <- tbl11.1 |> 
  tidyr::drop_na(job_cat) |> 
  dplyr::group_by(job_cat)  |> 
  dplyr::slice_sample(n = 500)

save_data_file("chap11", tbl11.2, "tbl11.2.rds")

glue::glue("********** Summarizing with base:summary() **************")
base::summary(tbl11.2)

glue::glue(" ")
glue::glue(" ")
glue::glue("********** Summarizing with skimr::skim() **************")
glue::glue(" ")
skimr::skim(tbl11.2)

#> ********** Summarizing with base:summary() **************
#>       age                  job_cat                  satis_advance
#>  Min.   :23.00   CS, Math, Eng :500   Very satisfied       :342  
#>  1st Qu.:32.00   Other Sciences:500   Somewhat satisfied   :646  
#>  Median :40.00   Nonscience    :500   Somewhat dissatisfied:361  
#>  Mean   :42.89                        Very dissatisfied    :151  
#>  3rd Qu.:54.00                                                   
#>  Max.   :75.00                                                   
#>                 satis_salary               satis_contrib     sex     
#>  Very satisfied       :420   Very satisfied       :732   Male  :911  
#>  Somewhat satisfied   :763   Somewhat satisfied   :570   Female:589  
#>  Somewhat dissatisfied:225   Somewhat dissatisfied:154               
#>  Very dissatisfied    : 92   Very dissatisfied    : 44               
#>                                                                      
#>                                                                      
#>  
#>  
#> ********** Summarizing with skimr::skim() **************
#>

Data summary
Name	tbl11.2
Number of rows	1500
Number of columns	6
_______________________
Column type frequency:
factor	4
numeric	1
________________________
Group variables	job_cat

Variable type: factor

skim_variable	job_cat	complete_rate	ordered	n_unique	top_counts
satis_advance	CS, Math, Eng	1	FALSE	4	Som: 227, Ver: 115, Som: 115, Ver: 43
satis_advance	Other Sciences	1	FALSE	4	Som: 213, Som: 131, Ver: 109, Ver: 47
satis_advance	Nonscience	1	FALSE	4	Som: 206, Ver: 118, Som: 115, Ver: 61
satis_salary	CS, Math, Eng	1	FALSE	4	Som: 254, Ver: 163, Som: 68, Ver: 15
satis_salary	Other Sciences	1	FALSE	4	Som: 257, Ver: 128, Som: 79, Ver: 36
satis_salary	Nonscience	1	FALSE	4	Som: 252, Ver: 129, Som: 78, Ver: 41
satis_contrib	CS, Math, Eng	1	FALSE	4	Som: 230, Ver: 193, Som: 61, Ver: 16
satis_contrib	Other Sciences	1	FALSE	4	Ver: 297, Som: 163, Som: 34, Ver: 6
satis_contrib	Nonscience	1	FALSE	4	Ver: 242, Som: 177, Som: 59, Ver: 22
sex	CS, Math, Eng	1	FALSE	2	Mal: 389, Fem: 111
sex	Other Sciences	1	FALSE	2	Mal: 263, Fem: 237
sex	Nonscience	1	FALSE	2	Mal: 259, Fem: 241

Variable type: numeric

skim_variable	job_cat	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
age	CS, Math, Eng	1	41.00	12.39	23	30.75	37	52	74	▇▅▃▃▁
age	Other Sciences	1	43.08	13.26	23	32.00	40	55	73	▇▆▅▅▂
age	Nonscience	1	44.60	13.18	23	33.00	43	55	75	▇▇▆▆▂

Listing / Output 11.3: Show sampled data: 500 from each job category

11.4 Achievement 1: EDA for multinomial logistic regression

11.4.1 Visualizing employment

Example 11.2 : Visualizing employment in computer science, math, and engineering by sex and age

R Code 11.4 : Plotting distribution of sex within job type

Listing / Output 11.4: Distribution of sex within job type among 1,500 college graduates in 2017

ggplot2::after_stat() replaces the old approach surrounding the variable names with .., e.g. ..prop... {ggplot2} throws a warning:

#> Warning: The dot-dot notation (..prop..) was deprecated in ggplot2 3.4.0.

#> ℹ Please use after_stat(prop) instead.

At first I had problems, because I used y = ggplot2::after_stat(count/sum(count)) inside the ggplot2::aes() function. This calculated the percentage over all different categories and not within each job category. The I learned with The ‘computed variables’ section in each stat lists which variables are available to access. that {ggplot2} computes with the ggplot2::stat_count() function for bar charts also groupwise proportion with ggplot2::after_stat(prop).

Resource 11.1 : How to use ggplot2::after_stat()?

To learn how to use the ggplot2::after_stat() function:

Read the help page to understand the differences between the different stages of mapping (direct input, after_stat() and after_scale()). Very important is the sentence: The ‘computed variables’ section in each stat lists which variables are available to access.
Read the short article Using after_stat() in {ggplot2} to use after_stat() to show percentages in the bar chart.
The first article of the very extensive series of articles going into many technical detail working — at least for me — as an eye opener how {ggplot2} works Demystifying delayed aesthetic evaluation: Part 1.
Hadley Wickham is currently preparing online the third edition of ggplot2: Elegant Graphics for Data Analysis which also has many details about the after_stat() function, for instance in 13 Build a plot layer by layer.

R Code 11.5 : Distribution of job type by sex

Listing / Output 11.5: Distribution of job type by sex

R Code 11.6 : Distribution of job type and age

Listing / Output 11.6: Distribution of job type and age

R Code 11.7 : Distribution of jobtype by age and sex

Listing / Output 11.7: Distribution of job type by age and sex

R Code 11.8 : Distribution by job type sex and age

Listing / Output 11.8: Distribution by job type sex and age

Report 11.1: Summary of the several plots about employment

Listing / Output 11.4: Computer science, math, and engineering have about a third as many females as males, other sciences and non-science were slightly more male than female.
Listing / Output 11.5: Computer science, math, and engineering jobs were the least common for females while this category was the largest for males.
Listing / Output 11.6: While the age range for all the data appeared similar across the three job types, the computer science, math, and engineering field employed the youngest people on average.
Listing / Output 11.7: In all three fields, the distribution of age showed that males have an older median age than females, and in the two science fields, the range of age is wider for males than females.
Listing / Output 11.8: The lowest median age for females is in computer science, math, and engineering and higher in other sciences and non-science. The age distribution for males showed a similar pattern across the three job types. Computer science, math, and engineering has the youngest median age for both sexes.

11.4.2 Checking bivariate statistical associations

Example 11.3 : Checking bivariate statistical associations between job type, sex, and age

age distribution
statistics

R Code 11.9 : Age distribution by job type among 1,500 college graduates in 2017

Listing / Output 11.9: Age distribution by job type among 1,500 college graduates in 2017

The histograms were not normally distributed for any of the three groups.

R Code 11.10 : Table of statistics to examine job_cat

Code

# make a table of statistics to examine job.cat
table_desc <- tableone::CreateTableOne(
  data = tbl11.2, 
  strata = 'job_cat',
  vars = c('sex', 'age')
  )

base::print(table_desc, 
            showAllLevels = TRUE, 
            nonnormal = 'age'
            )

#>                     Stratified by job_cat
#>                      level  CS, Math, Eng        Other Sciences      
#>   n                           500                  500               
#>   sex (%)            Male     389 (77.8)           263 (52.6)        
#>                      Female   111 (22.2)           237 (47.4)        
#>   age (median [IQR])        37.00 [30.75, 52.00] 40.00 [32.00, 55.00]
#>                     Stratified by job_cat
#>                      Nonscience           p      test   
#>   n                    500                              
#>   sex (%)              259 (51.8)         <0.001        
#>                        241 (48.2)                       
#>   age (median [IQR]) 43.00 [33.00, 55.00] <0.001 nonnorm

Listing / Output 11.10: Table of statistics to examine job_cat

The visual differences in the graphs corresponded to statistically significant differences from the chi-squared and Kruskal-Wallis tests. The median age for college graduates in computer science, math, or engineering was 3 years lower than the median age in other sciences and 6 years younger than the median age in non-science careers. Computer science, math, and engineering has more than three times as many males as females.

11.5 Achievement 2: Estimating a multinomial logistic regression model

11.5.1 Introduction

Bullet List 11.1

: Important to report with every model

Model significance: Is the model significantly better than some baseline at explaining the outcome?
Model fit: How well does the model capture the relationships in the underlying data?
Predictor values and significance: What is the size, direction, and significance of the relationship between each predictor and the outcome?
Checking model assumptions: Are the model assumptions met?

11.5.2 Check reference groups

Before starting with the computation for the multinomial logistic regression it is practicable to check the reference groups with base::levels(). The results are easier to interpret if the reference groups (= the first level) is consistent with what one is interested in. Otherwise use stats::relevel() or one of the many functions in {forcats}, for instance fct_recode() or fct_relevel().

R Code 11.11 : Check reference groups for regression

Code

base::levels(tbl11.2$job_cat)
base::levels(tbl11.2$sex)

#> [1] "CS, Math, Eng"  "Other Sciences" "Nonscience"    
#> [1] "Male"   "Female"

Listing / Output 11.11: Reference group for job_cat (= “CS, Math, Eng”) and sex (= “Male”)

The reference group are fine; no change is necessary.

11.5.3 NHST Step 1

Write the null and alternate hypotheses:

Wording for H0 and HA

H0: A multinomial model including sex and age is not useful in explaining or predicting job type for college graduates.
HA: A multinomial model including sex and age is useful in explaining or predicting job type for college graduates.

11.5.4 NHST Step 2

A Google search revealed the most of the R tutorials use the {nnet} package recommended also by SwR. A newer approach uses the meta-package {tidymodel} (similar to tidyverse) which collects 22 packages for modeling, containing {parsnip} with the multinom_reg() function that ca fit different classification models, including {nnet} as default package.

TODO: Learn {tidymodels}

I already learned from the {tidymodels} approach three-four years ago. But at that time I thought that it is too advanced for my skill level. This has changed now. I have worked through all of the three pre-requisites mentioned on the start page of tidymodels. And now — coming to the end of SwR — I understand why this unification project is important.

TODO 11.1: Learn {tidymodels}

Example 11.4 : Computing a multinomial logical regression

R Code 11.12 : Estimate model and print summary

Code

# estimate the model and print its summary
mnm11.1 <- nnet::multinom(
  formula = job_cat ~ age + sex + age*sex,
  data = tbl11.2,
  model = TRUE)

#> # weights:  15 (8 variable)
#> initial  value 1647.918433 
#> iter  10 value 1587.596983
#> final  value 1585.536069 
#> converged

Code

save_data_file("chap11", mnm11.1, "mnm11.1.rds")

summary(object = mnm11.1)

#> Call:
#> nnet::multinom(formula = job_cat ~ age + sex + age * sex, data = tbl11.2, 
#>     model = TRUE)
#> 
#> Coefficients:
#>                (Intercept)        age sexFemale age:sexFemale
#> Other Sciences   -1.108879 0.01671124  1.190186  5.013094e-05
#> Nonscience       -1.659125 0.02846208  1.521268 -6.082880e-03
#> 
#> Std. Errors:
#>                (Intercept)         age sexFemale age:sexFemale
#> Other Sciences   0.2724993 0.006032186 0.4968062    0.01166812
#> Nonscience       0.2803308 0.006040267 0.5010979    0.01162522
#> 
#> Residual Deviance: 3171.072 
#> AIC: 3187.072

Listing / Output 11.12: Computing a multinomial regression model for job types by age and sex with interaction

R Code 11.13 : Estimate null model

Code

# multinomial null model
mnm11.0 <- nnet::multinom(
  formula = job_cat ~ 1,
  data = tbl11.2,
  model = TRUE)

#> # weights:  6 (2 variable)
#> initial  value 1647.918433 
#> final  value 1647.918433 
#> converged

Code

summary(object = mnm11.0)

#> Call:
#> nnet::multinom(formula = job_cat ~ 1, data = tbl11.2, model = TRUE)
#> 
#> Coefficients:
#>                  (Intercept)
#> Other Sciences -1.081690e-12
#> Nonscience      2.222222e-12
#> 
#> Std. Errors:
#>                (Intercept)
#> Other Sciences  0.06324555
#> Nonscience      0.06324555
#> 
#> Residual Deviance: 3295.837 
#> AIC: 3299.837

Listing / Output 11.13: Multinomial null model

R Code 11.14 : Compute test statistics for the multinomial model mnm11.1

Code

# get the job model chi-squared
job_chisq <- mnm11.0$deviance - mnm11.1$deviance

# get the degrees of freedom for chi-squared
job_df <- length(x = summary(object = mnm11.1)$coefficients) - 
  length(x = summary(object = mnm11.0)$coefficients)

# get the p-value for chi-squared
job_p <- stats::pchisq(
  q = job_chisq, 
  df = job_df, 
  lower.tail = FALSE
  )

# put together into a vector and round to 3 decimal places
model_sig <- base::round(x = c(job_chisq, job_df, job_p), 3)

# add names to the vector
base::names(x = model_sig) <- c("Chi-squared", "df", "p")

# print the vector
model_sig

#> Chi-squared          df           p 
#>     124.765       6.000       0.000

Listing / Output 11.14: Test statistics for the multinomial model

Finishing step 2 of the NHST procedure:

Compute the test statistics.

The test statistic is a chi-squared of 124.77 with 6 degrees of freedom.

11.5.5 NHST Step 3

Review and interpret the test statistics:

Calculate the probability that your test statistic is at least as big as it is if there is no relationship (i.e., the null is true).

The probability computed for the chi-squared was < .001.

11.5.6 NHST Step 4

Conclude and write report.

Report 11.2: Report the conclusion of the interpretation of the multinomial model mnm11.1

We have to reject the null hypothesis and conclude that a model including age, sex, and age*sex explained job type statistically significantly better [$χ^2(6) = 124.77; p < .001$] than a null model with no predictors.

11.5.7 Multinomial model fit

Example 11.5 : Multinomial model fit

fit & predict
observed & predicted

R Code 11.15 : Show fitted.values and predict() results

Code

df1 <- my_glance_data(data.frame(mnm11.1$fitted.values))
df2 <- my_glance_data(data.frame(predicted = stats::predict(mnm11.1)))
dplyr::full_join(df1, df2, by = dplyr::join_by(obs))

#>     obs CS..Math..Eng Other.Sciences Nonscience      predicted
#> 1     1     0.4029162      0.2964827  0.3006010  CS, Math, Eng
#> 2    49     0.3705839      0.3014510  0.3279652  CS, Math, Eng
#> 3   321     0.4855538      0.2780725  0.2363737  CS, Math, Eng
#> 4   561     0.2156303      0.3999086  0.3844611 Other Sciences
#> 5   634     0.1964633      0.4029106  0.4006260 Other Sciences
#> 6  1098     0.3186954      0.3063952  0.3749094     Nonscience
#> 7  1170     0.1995693      0.4024774  0.3979533 Other Sciences
#> 8  1177     0.3547166      0.3033783  0.3419051  CS, Math, Eng
#> 9  1252     0.2027106      0.4020175  0.3952718 Other Sciences
#> 10 1500     0.1393026      0.4062125  0.4544848     Nonscience

Listing / Output 11.15: Random example rows of fitted.values to compare with the results from the predict() function

obs column: Number of row taken from the sample data tbl11.2.
The next column are the predicted probability for each person to be in each of the three job type groups. The data are taken from the fitted-values result of the multinomial model mnm11.1.
The last column predict the job category for each randomly chosen person. This is always the highest probability from the fitted.values categories.

R Code 11.16 : Compare observed job category with the predicted value

Code

# observed vs. predicted category for each observation
fit_abs <- base::table(
  observed = mnm11.1$model$job_cat,
  predicted = stats::predict(object = mnm11.1))

glue::glue("Absolute values: observed vs. predicted category for each observation")
fit_abs

# observed vs. predicted category for each observation
fit_perc <- base::proportions(
  base::table(
    observed = mnm11.1$model$job_cat,
    predicted = predict(object = mnm11.1)
            ),
  margin = 1
)

glue::glue(" ")
glue::glue(" ")
glue::glue("Percentages: observed vs. predicted category for each observation")
fit_perc

#> Absolute values: observed vs. predicted category for each observation
#>                 predicted
#> observed         CS, Math, Eng Other Sciences Nonscience
#>   CS, Math, Eng            342             66         92
#>   Other Sciences           206            129        165
#>   Nonscience               201            112        187
#>  
#>  
#> Percentages: observed vs. predicted category for each observation
#>                 predicted
#> observed         CS, Math, Eng Other Sciences Nonscience
#>   CS, Math, Eng          0.684          0.132      0.184
#>   Other Sciences         0.412          0.258      0.330
#>   Nonscience             0.402          0.224      0.374

Listing / Output 11.16: Observed job category versus the predicted value of the model (absolute and percentages)

The model was best at predicting the computer science, math, or engineering job type, with 68.4% of the observations in this category correctly predicted. The nonscience job type was predicted with 37.4% accuracy, and other sciences was predicted with 25.8% accuracy. Overall, the model predicted job type correctly for 658 out of 1,500 observations (43.9%). Given that the sample included 500 people from each job type, without any other information, we would guess that everyone was in a single category and would be right for 500, or 33.3%, of the observations. Although the overall correctness was low, it was higher than this baseline probability of correctly classifying observations by job type.

11.5.8 Multinomial model predictor interpretation

We know that the model is statistically significantly better than baseline at explaining job type and that the predictions from the model were better than a baseline percentage would be. We also know that the prediction was best for computer science, math, and engineering. The next thing we want to examine is the predictor fit and significance.

11.5.8.1 Predictor significance

Listing / Output 11.12 does not include any indication of whether or not each predictor (age, sex, age*sex) is statistically significantly associated with job type. To get statistical significance and more interpretable values, we need to compute odds ratios and 95% confidence intervals around the odds ratios.

Example 11.6 : Predictor significance

R Code 11.17 : Get odds ratios

Code

# get odds ratios and transpose to get 'standard' format
base::t(x = base::exp(x = stats::coef(object = mnm11.1)))

#>               Other Sciences Nonscience
#> (Intercept)        0.3299285  0.1903055
#> age                1.0168517  1.0288710
#> sexFemale          3.2876930  4.5780248
#> age:sexFemale      1.0000501  0.9939356

Listing / Output 11.17: Odds ratios

Caution 11.1: Watch out for the correct reference group

It gets more complicated with this type of regression. Not only do the predictors have reference groups, but the outcome variable also has a reference group.

The reference group for the outcome variable is computer science, math, or engineering. This was not only our result with Listing / Output 11.11 but it is confirmed by missing this group in the result. (The missing group is always the reference group.)

Report 11.3: Example of an interpretation of the odds ratios

For every year older a person gets, the odds of having a career in other sciences is 1.02 times higher compared to the odds of being in computer science, math, or engineering.

R Code 11.18 : Compute confidence intervals

Code

# confidence intervals for odds ratios
base::exp(x = stats::confint(object = mnm11.1))

#> , , Other Sciences
#> 
#>                   2.5 %    97.5 %
#> (Intercept)   0.1934052 0.5628228
#> age           1.0049003 1.0289451
#> sexFemale     1.2416783 8.7050932
#> age:sexFemale 0.9774394 1.0231839
#> 
#> , , Nonscience
#> 
#>                   2.5 %     97.5 %
#> (Intercept)   0.1098584  0.3296623
#> age           1.0167623  1.0411239
#> sexFemale     1.7145211 12.2240033
#> age:sexFemale 0.9715448  1.0168424

Listing / Output 11.18: Confidence intervals for odds ratios

Remember: A confidence interval of odds ratio is statistically significant if it does not include 1. This is the case for age and Female sex but not for the interaction term age:sexFemale.

R Code 11.19 : Put odd ratios and confidence interval together into the same data table

Code

# get odds ratios for other sciences from the model object
oddsratio_other_sci <- 
  base::t(x = base::exp(x = stats::coef(object = mnm11.1)))[ , 1]

# get CI for other sciences
confint_other_sci <- 
  base::exp(x = stats::confint(object = mnm11.1))[ , 1:2, 1]

# put into a data frame 
other_sci <- base::data.frame(
  oddsratio_other = oddsratio_other_sci,
  ci_other = confint_other_sci)

# get odds ratios for non-science
oddsratio_non_sci <- 
  base::t(x = base::exp(x = stats::coef(object = mnm11.1)))[ , 2]

# get CI for non-science
confint_non_sci <- 
  base::exp(x = stats::confint(object = mnm11.1))[ , 1:2, 2]

# put into a data frame
non_sci <- base::data.frame(
  oddsratio_non = oddsratio_non_sci,
  ci_non = confint_non_sci)

# all together
data.frame(other_sci, non_sci)

#>               oddsratio_other ci_other.2.5.. ci_other.97.5.. oddsratio_non
#> (Intercept)         0.3299285      0.1934052       0.5628228     0.1903055
#> age                 1.0168517      1.0049003       1.0289451     1.0288710
#> sexFemale           3.2876930      1.2416783       8.7050932     4.5780248
#> age:sexFemale       1.0000501      0.9774394       1.0231839     0.9939356
#>               ci_non.2.5.. ci_non.97.5..
#> (Intercept)      0.1098584     0.3296623
#> age              1.0167623     1.0411239
#> sexFemale        1.7145211    12.2240033
#> age:sexFemale    0.9715448     1.0168424

Listing / Output 11.19: Odd ratios and confidence intervals of the multinomial model mnm11.1

The first three columns of numbers are the odds ratios and confidence intervals for the job type of other sciences, while columns 4 through 6 are for the non-science job type.

11.5.8.2 Predictor interpretation

The outcome variable with multiple categories now had a reference group. In this case, the reference group is computer science, math, or engineering job type. The odds ratios are interpreted with respect to this reference group. This works OK for the continuous variable of age but gets a little tricky for the categorical variable of sex, where there are now two reference groups to consider.

Report 11.4: Predictor interpretation of the multinomial logistic regression model mnm11.1

The age row starts with the odds ratio of 1.02 with confidence interval 1.00–1.03. For every 1-year increase in age, the odds of being in an other sciences job are 1.02 times higher than being in a computer science, math, or engineering job ($95% CI: 1.00–1.03$). Likewise, for every 1-year increase in age, the odds of being in a non-science job are 1.03 times higher than being in a computer science, math, or engineering job ($95% CI: 1.02–1.04$).

Compared to males, the odds of females being in an other sciences job are 3.29 times higher than being in a computer science, math, or engineering job ($95% CI: 1.24–8.71$). Also, compared to males, females have 4.58 times higher odds of being in non-science jobs compared to computer science, math, or engineering jobs ($95% CI: 1.71–12.22$).

The interaction between age and sex was not statistically significant for either other sciences jobs or non-science jobs compared to computer science, math, or engineering jobs.

Overall, it seems that females had higher odds than males of being in other sciences or nonscience compared to being in computer science, math, or engineering. Likewise, the older someone gets, the more likely they are to work in other sciences or non-science compared to computer science, math, or engineering. It seems that, overall, computer science, math, or engineering job types are most likely to be held by males and people who are younger.

11.6 Achievement 3: Checking assumptions for multinomial logistic regression

Besides the assumption of independence of observations (which is met), there is the IIA assumption. The independence of irrelevant alternatives assumption requires that the categories of the outcome be independent of one another. For example, the relative probability of having a job in the computer science, math, or engineering field compared to the other science field cannot depend on the existence of non-science jobs.

This assumption can be tested by taking a subset of the data that includes two of the outcome categories and estimating binary logistic regression models with the same predictors. These model results can then be compared with the multinomial model results to see if the relationship is consistent. The test used for this assumption is the Hausman-McFadden test.

Remark 11.1. : What is the independence of irrelevant alternatives (IIA) assumption?

The above explanation of IIA in SwR is easy to misunderstand. The important clarification is that the third alternative (C) is not an alternative of the choice between A and B. Therefore we are talking of irrelevant alternatives. So what we state is that the probability of choosing A or B is independent of C. The following quote from Statistical Horizons brings an example that helped my understanding more than the example in SwR:

In discrete choice theory, the IIA assumption says that when people are asked to choose among a set of alternatives, their odds of choosing A over B should not depend on whether some other alternative C is present or absent. As an example, consider the 1992 U.S. presidential election. The two major party candidates were Bill Clinton and George H. W. Bush. But H. Ross Perot was also on the ballot in all 50 states. The IIA assumption says that, for each voter, the odds of choosing Clinton over Bush was not affected by Perot’s presence on the ballot. (Allison 2012)

The test strategy is the following: for each alternative, delete individuals who chose that alternative and re-estimate the model for the remaining alternatives; then construct a test comparing the new estimates with the original estimates.

In our voting example, for instance, we could exclude the people who voted for Perot, and estimate a binary logit model predicting Clinton vs. Bush. We could then test whether the binary coefficients were the same as the multinomial coefficients. We could also exclude the people who voted for Clinton and re-estimate the second equation by binary logit. Again, we could compare the binary coefficients with the multinomial coefficients. And, finally, we could redo the test with Bush as the excluded category.

Example 11.7 : Compute the Hausman-McFadden test for independence of irrelevant alternatives

R Code 11.20 : Re-estimate the model with {mlogit}

Code

# reshape data to use with the dfidx function from {dfidx}

tbl11.2 <- base::readRDS("data/chap11/tbl11.2.rds")

tbl11.2_wide <- dfidx::dfidx(data = tbl11.2,
                               choice = "job_cat",
                               shape = "wide")

# estimate the model and print its summary
mnm3 <- mlogit::mlogit(formula = job_cat ~ 0 | age + sex + age*sex,
                     data = tbl11.2_wide)

summary(object = mnm3)

#> 
#> Call:
#> mlogit::mlogit(formula = job_cat ~ 0 | age + sex + age * sex, 
#>     data = tbl11.2_wide, method = "nr")
#> 
#> Frequencies of alternatives:choice
#>  CS, Math, Eng     Nonscience Other Sciences 
#>        0.33333        0.33333        0.33333 
#> 
#> nr method
#> 4 iterations, 0h:0m:0s 
#> g'(-H)^-1g = 3.03E-07 
#> gradient close to zero 
#> 
#> Coefficients :
#>                               Estimate  Std. Error z-value  Pr(>|z|)    
#> (Intercept):Nonscience     -1.6591e+00  2.8033e-01 -5.9184 3.250e-09 ***
#> (Intercept):Other Sciences -1.1089e+00  2.7250e-01 -4.0693 4.716e-05 ***
#> age:Nonscience              2.8462e-02  6.0403e-03  4.7121 2.452e-06 ***
#> age:Other Sciences          1.6711e-02  6.0322e-03  2.7703  0.005600 ** 
#> sexFemale:Nonscience        1.5212e+00  5.0110e-01  3.0358  0.002399 ** 
#> sexFemale:Other Sciences    1.1902e+00  4.9681e-01  2.3956  0.016592 *  
#> age:Nonscience             -6.0824e-03  1.1625e-02 -0.5232  0.600830    
#> age:Other Sciences          5.0832e-05  1.1668e-02  0.0044  0.996524    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Log-Likelihood: -1585.5
#> McFadden R^2:  0.037855 
#> Likelihood ratio test : chisq = 124.76 (p.value = < 2.22e-16)

Listing / Output 11.20: Model re-estimated with the {mlogit} package.

The function mlogit::mlogit.data() used in the book is deprecated and should be replaced with dfidx() from the {dfidx} package (see: Section A.16).

R Code 11.21 : Estimate the model with two outcome categories

Code

# estimate the model with two outcome categories and print its summary
mnm3_alt <- 
  mlogit::mlogit(formula = job_cat ~ 0 | age + sex + age*sex,
  data = tbl11.2_wide,
  alt.subset = c("CS, Math, Eng", "Nonscience"))

summary(object = mnm3_alt)

#> 
#> Call:
#> mlogit::mlogit(formula = job_cat ~ 0 | age + sex + age * sex, 
#>     data = tbl11.2_wide, alt.subset = c("CS, Math, Eng", "Nonscience"), 
#>     method = "nr")
#> 
#> Frequencies of alternatives:choice
#> CS, Math, Eng    Nonscience 
#>           0.5           0.5 
#> 
#> nr method
#> 4 iterations, 0h:0m:0s 
#> g'(-H)^-1g = 0.000204 
#> successive function values within tolerance limits 
#> 
#> Coefficients :
#>                          Estimate Std. Error z-value  Pr(>|z|)    
#> (Intercept):Nonscience -1.6914715  0.2849648 -5.9357 2.926e-09 ***
#> age:Nonscience          0.0291976  0.0061498  4.7478 2.057e-06 ***
#> sexFemale:Nonscience    1.5384978  0.5068185  3.0356  0.002401 ** 
#> age:Nonscience         -0.0064503  0.0117627 -0.5484  0.583441    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Log-Likelihood: -641.19
#> McFadden R^2:  0.074965 
#> Likelihood ratio test : chisq = 103.92 (p.value = < 2.22e-16)

Listing / Output 11.21: Model estimated with two outcome categories

R Code 11.22 : Hausman-McFadden test

Code

mlogit::hmftest(x = mnm3, z = mnm3_alt)

#> 
#>  Hausman-McFadden test
#> 
#> data:  tbl11.2_wide
#> chisq = 0.4851, df = 4, p-value = 0.9749
#> alternative hypothesis: IIA is rejected

Listing / Output 11.22: Hausman-McFadden test

The p-value was .97, which is much higher than .05, so the null hypothesis is retained and the IIA assumption is met.

11.6.1 Conclusion of the multinomial logistic regression

Report 11.5: Report of the multinomial logistic regression of job types versus age and sex (final version)

A multinomial logistic regression including age, sex, and the interaction between age and sex was statistically significantly better at predicting job type than a null model with no predictors [$χ^2(6) = 124.77; p < .001$]. The model met its assumptions and correctly predicted the job type for 43.9% of observations.

For every 1-year increase in age, the odds of being in an other sciences job are 1.02 times higher than being in a computer science, math, or engineering job (95% CI: 1.00–1.03).
Likewise, for every 1-year increase in age, the odds of being in a nonscience job are 1.03 times higher than being in a computer science, math, or engineering job (95% CI: 1.02–1.04).
Compared to males, the odds of females being in an other sciences job are 3.29 times higher than being in a computer science, math, or engineering job (95% CI: 1.24–8.71).
Also, compared to males, females have 4.58 times higher odds of being in nonscience jobs compared to computer science, math, or engineering jobs (95% CI: 1.71–12.22).
The interaction between age and sex was not a statistically significant indicator of job type.

11.7 Achievement 4: EDA for ordinal logistic regression

11.7.1 Introduction

We are not going to compare satisfaction of women to the satisfaction of men with computer science, math, or engineering. Out interest is in understanding whether job satisfaction differs for women across the three job types.

We will take a sample of female participants from the original data frame so that we would have a larger number in the sample than using the females from the subset of 1,500. Since there were very few people (relatively speaking) in the dissatisfied categories (see Listing / Output 11.3), we will take 250 people from each category of one of the satisfaction variables to ensure to get enough people in the dissatisfied categories.

Example 11.8 : Exploratory data analysis for a ordinal logistic regression models

R Code 11.23 : Sample of 1,000 females

Code

tbl11.1 <- base::readRDS("data/chap11/tbl11.1.rds")

base::RNGkind(sample.kind = "Rejection")
set.seed(seed = 143)

tbl11.3 <- tbl11.1  |> 
  tidyr::drop_na(job_cat)  |> 
  dplyr::filter(sex == "Female") |> 
  dplyr::group_by(satis_contrib)  |> 
  dplyr::slice_sample(n = 250)  |> 
  dplyr::ungroup()

save_data_file("chap11", tbl11.3, "tbl11.3.rds")

# check work
summary(object = tbl11.3)

#>       age                  job_cat                  satis_advance
#>  Min.   :23.00   CS, Math, Eng :177   Very satisfied       :160  
#>  1st Qu.:31.00   Other Sciences:133   Somewhat satisfied   :356  
#>  Median :37.00   Nonscience    :690   Somewhat dissatisfied:297  
#>  Mean   :40.51                        Very dissatisfied    :187  
#>  3rd Qu.:50.00                                                   
#>  Max.   :72.00                                                   
#>                 satis_salary               satis_contrib     sex      
#>  Very satisfied       :235   Very satisfied       :250   Male  :   0  
#>  Somewhat satisfied   :455   Somewhat satisfied   :250   Female:1000  
#>  Somewhat dissatisfied:184   Somewhat dissatisfied:250                
#>  Very dissatisfied    :126   Very dissatisfied    :250                
#>                                                                       
#>

Listing / Output 11.23: Sample of 1,000 females

Warning 11.1: Different sample figures with the same set.seed() command

I got different figures after data sampling. I do not know from where the difference comes from. I had tried exactly the same code (e.g., using the deprecated function sample_n()), also I changed the set.seed() command to base::RNGkind(sample.kind = "Rounding") to meet the randomized sampling before version R 3.6.0. But in every case I got different figures than in the book example.

R Code 11.24 : Job satisfaction by society contribution

Listing / Output 11.24: Job satisfaction by society contribution

R Code 11.25 : Job satisfaction by salary

Listing / Output 11.25: Job satisfaction by salary

R Code 11.26 : Job satisfaction by advance of science

Listing / Output 11.26: Job satisfaction by advance of science

R Code 11.27 : Job satisfaction with salary by age

Listing / Output 11.27: Job satisfaction with salary by age

R Code 11.28 : Create Likert-scale with ordered factors

Code

OrderSatis <- function(x){
  return(ordered(x, levels = c("Very dissatisfied",
                               "Somewhat dissatisfied",
                               "Somewhat satisfied",
                               "Very satisfied")
                 )
         )
}

# use function to order 3 satisfaction variables
tbl11.4 <- tbl11.3 |> 
  dplyr::mutate(satis_advance = OrderSatis(x = satis_advance))  |> 
  dplyr::mutate(satis_salary = OrderSatis(x = satis_salary))  |> 
  dplyr::mutate(satis_contrib = OrderSatis(x = satis_contrib))

save_data_file("chap11", tbl11.4, "tbl11.4.rds")

# check the data
skimr::skim(tbl11.4)

Data summary
Name	tbl11.4
Number of rows	1000
Number of columns	6
_______________________
Column type frequency:
factor	5
numeric	1
________________________
Group variables	None

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
job_cat	1	FALSE	3	Non: 690, CS,: 177, Oth: 133
satis_advance	1	TRUE	4	Som: 356, Som: 297, Ver: 187, Ver: 160
satis_salary	1	TRUE	4	Som: 455, Ver: 235, Som: 184, Ver: 126
satis_contrib	1	TRUE	4	Ver: 250, Som: 250, Som: 250, Ver: 250
sex	1	FALSE	1	Fem: 1000, Mal: 0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
age	0	1	40.51	11.89	23	31	37	50	72	▇▇▅▃▁

Listing / Output 11.28: {skimr} shows that the satisfaction factors are now ordered factors.

With this change I had some troubles, because I haven’t had any experience how to do and also to check if the levels are ordered or not. The check in the book with base::summary does not reveal if the factors are ordered or not. The summary() reports the sequence of the levels but not if they are ordered.

You can’t use the function base::levels() neither, because it also prints out only the chronological sequence but not if this sequence consists of ordered factor levels.

Important 11.1: No possibility found in {forcats} to create factor with ordered levels

I could manipulate the (chronological) appearance of levels in many ways but I did not find an option to create the class "ordered" "factor".

Therefore I have to use one of the following base R options options:

base::as.ordered()
base::factor(x = c("A", "Z", "M"), levels = c("A", "B", "M", "Z"), ordered = TRUE)
base::ordered(c("A", "Z", "M"), level = c("Z", "M", "A"))

The following examples uses in the first line with tbl11.1$satis_advance an ordered factor, and with the second line (after the ‘—’ line) with tbl11.1$sex a factor without order.

R Code 11.29 : How to check the order of factor levels?

Code

tbl11.4 <- base::readRDS("data/chap11/tbl11.4.rds")

glue::glue("##########################################################################")
glue::glue("base::levels() does not reveal if theres is an order in the factor levels")
base::levels(tbl11.4$satis_advance)
glue::glue("--------------------------------------------------------------------------")
base::levels(tbl11.4$sex)

glue::glue(" ")
glue::glue("##########################################################################")
glue::glue("If you display a value of an ordered factor, it reveals the order with '<'")
head(tbl11.4$satis_advance, 1)
glue::glue(" ")
glue::glue("To prevent showing one value one can use forcats::fct_unique()")
forcats::fct_unique(tbl11.4$satis_advance)
glue::glue("--------------------------------------------------------------------------")
head(tbl11.4$sex, 1)


glue::glue("How to check if a factor has ordered levels, but not the order itself?")
glue::glue(" ")
glue::glue("##########################################################################")
glue::glue("utils::str()")
utils::str(tbl11.4)

glue::glue(" ")
glue::glue("##########################################################################")
glue::glue("base::class()")
base::class(tbl11.4$satis_advance)
glue::glue("--------------------------------------------------------------------------")
base::class(tbl11.4$sex)

glue::glue(" ")
glue::glue("##########################################################################")
glue::glue("base::is.ordered()")
base::is.ordered(tbl11.4$satis_advance)
glue::glue("--------------------------------------------------------------------------")
base::is.ordered(tbl11.4$sex)

glue::glue(" ")
glue::glue("##########################################################################")
glue::glue("skimr::skim()) has a column 'ordered' about the status of the factor variable")
tbl11_temp <- tbl11.4 |> dplyr::select(satis_advance, sex)
skimr::skim(tbl11_temp)

#> ##########################################################################
#> base::levels() does not reveal if theres is an order in the factor levels
#> [1] "Very dissatisfied"     "Somewhat dissatisfied" "Somewhat satisfied"   
#> [4] "Very satisfied"       
#> --------------------------------------------------------------------------
#> [1] "Male"   "Female"
#>  
#> ##########################################################################
#> If you display a value of an ordered factor, it reveals the order with '<'
#> [1] Very satisfied
#> 4 Levels: Very dissatisfied < Somewhat dissatisfied < ... < Very satisfied
#>  
#> To prevent showing one value one can use forcats::fct_unique()
#> [1] Very dissatisfied     Somewhat dissatisfied Somewhat satisfied   
#> [4] Very satisfied       
#> 4 Levels: Very dissatisfied < Somewhat dissatisfied < ... < Very satisfied
#> --------------------------------------------------------------------------
#> [1] Female
#> Levels: Male Female
#> How to check if a factor has ordered levels, but not the order itself?
#>  
#> ##########################################################################
#> utils::str()
#> tibble [1,000 × 6] (S3: tbl_df/tbl/data.frame)
#>  $ age          : num [1:1000] 35 27 32 64 67 31 56 42 55 53 ...
#>  $ job_cat      : Factor w/ 3 levels "CS, Math, Eng",..: 3 3 2 3 3 1 2 3 3 3 ...
#>  $ satis_advance: Ord.factor w/ 4 levels "Very dissatisfied"<..: 4 4 3 4 4 4 2 3 4 4 ...
#>  $ satis_salary : Ord.factor w/ 4 levels "Very dissatisfied"<..: 3 4 3 3 4 4 3 3 4 4 ...
#>  $ satis_contrib: Ord.factor w/ 4 levels "Very dissatisfied"<..: 4 4 4 4 4 4 4 4 4 4 ...
#>  $ sex          : Factor w/ 2 levels "Male","Female": 2 2 2 2 2 2 2 2 2 2 ...
#>  
#> ##########################################################################
#> base::class()
#> [1] "ordered" "factor" 
#> --------------------------------------------------------------------------
#> [1] "factor"
#>  
#> ##########################################################################
#> base::is.ordered()
#> [1] TRUE
#> --------------------------------------------------------------------------
#> [1] FALSE
#>  
#> ##########################################################################
#> skimr::skim()) has a column 'ordered' about the status of the factor variable

Data summary
Name	tbl11_temp
Number of rows	1000
Number of columns	2
_______________________
Column type frequency:
factor	2
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
satis_advance	0	1	TRUE	4	Som: 356, Som: 297, Ver: 187, Ver: 160
sex	0	1	FALSE	1	Fem: 1000, Mal: 0

Listing / Output 11.29: Several options to check the order of factor levels

11.7.2 Examine satisfaction by job type for females

Example 11.9 : Examine satisfaction by job type for females with the Kruskal-Wallis test

salary
society contribution

R Code 11.30 : Examine salary satisfaction by job type for females with the Kruskal-Wallis test

Code

# examine salary satisfaction by job type for females
sat_salary_kw <- stats::kruskal.test(
  formula = as.integer(x = satis_salary) ~ job_cat,
  data = tbl11.4
  )
sat_salary_kw

#> 
#>  Kruskal-Wallis rank sum test
#> 
#> data:  as.integer(x = satis_salary) by job_cat
#> Kruskal-Wallis chi-squared = 21.46, df = 2, p-value = 2.187e-05

Code

#examine salary satisfaction by age
sat_salary_rho <- cor.test(
  formula = ~ as.integer(x = satis_salary) + age,
  method = "spearman",
  data = tbl11.4)

#> Warning in cor.test.default(x = mf[[1L]], y = mf[[2L]], ...): Cannot compute
#> exact p-value with ties

Code

sat_salary_rho

#> 
#>  Spearman's rank correlation rho
#> 
#> data:  as.integer(x = satis_salary) and age
#> S = 154853493, p-value = 0.025
#> alternative hypothesis: true rho is not equal to 0
#> sample estimates:
#>        rho 
#> 0.07087811

Listing / Output 11.30: Salary satisfaction by job type for females

I got a warning, that the test cannot compute exact p-values with ties. I didn’t see this warning in the book example.

Remember that I got different figures after the sampling procedure, but all in all the results are the sames: The satisfaction with salary was significantly associated with job type but not with age.

Report 11.6: Salary satisfaction by job type for females

The satisfaction with salary was significantly associated with job type but not with age. Specifically, there was a statistically significant relationship between job type and satisfaction with salary $\chi^2_{KW}(2)21.46, p< 0.01$. In addition, there was a non-significant and weak positive correlation between age and salary satisfaction (rs = 0.07; p = 0.025).

R Code 11.31 : Examine society contribution satisfaction by job type for females with the Kruskal-Wallis test

Code

# examine society contribution satisfaction by job type for females
sat_contrib_kw <- stats::kruskal.test(
  formula = as.integer(x = satis_contrib) ~ job_cat,
  data = tbl11.4
  )
sat_contrib_kw

#> 
#>  Kruskal-Wallis rank sum test
#> 
#> data:  as.integer(x = satis_contrib) by job_cat
#> Kruskal-Wallis chi-squared = 10.523, df = 2, p-value = 0.005187

Code

#examine society contribution satisfaction by age
sat_contrib_rho <- cor.test(
  formula = ~ as.integer(x = satis_contrib) + age,
  method = "spearman",
  data = tbl11.4)

#> Warning in cor.test.default(x = mf[[1L]], y = mf[[2L]], ...): Cannot compute
#> exact p-value with ties

Code

sat_contrib_rho

#> 
#>  Spearman's rank correlation rho
#> 
#> data:  as.integer(x = satis_contrib) and age
#> S = 146366521, p-value = 0.0001128
#> alternative hypothesis: true rho is not equal to 0
#> sample estimates:
#>    rho 
#> 0.1218

Listing / Output 11.31: Society contribution satisfaction by job type for females

Report 11.7: Society contribution satisfaction by job type for females

The satisfaction with society contribution was significantly associated with job type and also with age. Specifically, there was a statistically significant relationship between job type and satisfaction with society contribution $\chi^2_{KW}(2)10.52, p< 0.01$. In addition, there was also a significant and a weak positive correlation between age and society contribution satisfaction (rs = 0.12; p < .001).

11.8 Avchievement 5: Estimating an ordinal logistic regression model

For ordinal categorical variables, a model is needed that can account for the order of the outcome variable.

11.8.1 NHST Step 1

Write the null and alternate hypotheses:

Wording for H0 and HA

H0: Job type and age are not helpful in explaining satisfaction with salary for female college graduates.
HA: Job type and age are helpful in explaining satisfaction with salary for female college graduates.

11.8.2 NHST Step 2

Compute the test statistic.

The ordinal regression model is estimated using the polr() function from the {MASS} package (see: Section A.47).

R Code 11.32 : Compute the statistics for the ordinal logistic regression model

Code

# ordinal logistic regression for salary satisfaction
# model salary satisfaction based on job type and age
olm11.1 <- MASS::polr(
  formula = satis_salary ~ job_cat + age,
  data = tbl11.4)

save_data_file("chap11", olm11.1, "olm11.1.rds")

summary(object = olm11.1)

#> 
#> Re-fitting to get Hessian

#> Call:
#> MASS::polr(formula = satis_salary ~ job_cat + age, data = tbl11.4)
#> 
#> Coefficients:
#>                          Value Std. Error t value
#> job_catOther Sciences -0.68030   0.210633  -3.230
#> job_catNonscience     -0.73586   0.156240  -4.710
#> age                    0.01168   0.004934   2.367
#> 
#> Intercepts:
#>                                          Value   Std. Error t value
#> Very dissatisfied|Somewhat dissatisfied  -2.0899  0.2471    -8.4587
#> Somewhat dissatisfied|Somewhat satisfied -0.9366  0.2377    -3.9398
#> Somewhat satisfied|Very satisfied         1.0842  0.2384     4.5473
#> 
#> Residual Deviance: 2515.41 
#> AIC: 2527.41

Code

# estimate null model
olm11.0 <- MASS::polr(
  formula = satis_salary ~ 1,
  data = tbl11.4)

# job model chi-squared
salary_chisq <- olm11.0$deviance - olm11.1$deviance

# degrees of freedom for chi-squared
salary_df <- 
  base::length(x = olm11.1$coefficients) - 
  base::length(x = olm11.0$coefficients)

# pvalue for chi-squared
salary_p <- stats::pchisq(
  q = salary_chisq, 
  df = salary_df, 
  lower.tail = FALSE
  )

# put together and print
modelsig_salary <- 
  base::round(x = c(salary_chisq, salary_df, salary_p), 3)

base::names(x = modelsig_salary) <- c("Chi-squared", "d.f.", "p")

modelsig_salary

#> Chi-squared        d.f.           p 
#>      26.785       3.000       0.000

Listing / Output 11.32: Computing the ordinal logistic regression model

Similar to the multinomial model, the model summary includes the residual deviance but not the null deviance to compute the model chi-squared. Following the strategy used for the multinomial model, we had to compute the null deviance by estimating a model with no predictors and computing the chi-squared using the deviance values from the two models.

The model chi-squared is 26.79 with 3 degrees of freedom.

11.8.3 NHST Step 3

Review and interpret the test statistics: Calculate the probability that your test statistic is at least as big as it is if there is no relationship (i.e., the null is true).

The p-value for $χ^2(3) = 26.79$ is less than .001.

11.8.4 NHST Step 4

Conclude and write report.

We reject the null hypothesis. Job type and age statistically significantly help to predict satisfaction with salary for female college graduates [$χ2(3) = 26.79; p < .001$].

11.8.5 Ordinal logistic regression model fit

Example 11.10 : Numbered Example Title

fit model
odds & confint

R Code 11.33 : Ordinal logistic regression model fit

Code

# observed vs. predicted category for each observation
fit_salary_abs <- table(
  observed = olm11.1$model$satis_salary,
  predicted = stats::predict(object = olm11.1)
  )
fit_salary_abs

#>                        predicted
#> observed                Very dissatisfied Somewhat dissatisfied
#>   Very dissatisfied                     0                     0
#>   Somewhat dissatisfied                 0                     0
#>   Somewhat satisfied                    0                     0
#>   Very satisfied                        0                     0
#>                        predicted
#> observed                Somewhat satisfied Very satisfied
#>   Very dissatisfied                    126              0
#>   Somewhat dissatisfied                184              0
#>   Somewhat satisfied                   454              1
#>   Very satisfied                       234              1

Code

# observed vs. predicted category for each observation
fit_salary_perc <- prop.table(x = table(
    observed = olm11.1$model$satis_salary,
    predicted = predict(object = olm11.1)
              ),
  margin = 1
  )
fit_salary_perc

#>                        predicted
#> observed                Very dissatisfied Somewhat dissatisfied
#>   Very dissatisfied           0.000000000           0.000000000
#>   Somewhat dissatisfied       0.000000000           0.000000000
#>   Somewhat satisfied          0.000000000           0.000000000
#>   Very satisfied              0.000000000           0.000000000
#>                        predicted
#> observed                Somewhat satisfied Very satisfied
#>   Very dissatisfied            1.000000000    0.000000000
#>   Somewhat dissatisfied        1.000000000    0.000000000
#>   Somewhat satisfied           0.997802198    0.002197802
#>   Very satisfied               0.995744681    0.004255319

Listing / Output 11.33: Ordinal logistic regression model fit

Warning 11.2: Predicted values wrong!

With the exception of two values all predicted values are “Somewhat satisfied”. That seems strange! But the result is that the model predicted 455 (45,5%) of the observations in contrast to 267 (26.7%). That is much better than the figures from the book.

But I still think that there may be something wrong in my computation.

R Code 11.34 : Compute odds ratios and confidence intervals

Code

# odds ratios and confidence intervals
or_sal <- data.frame(OR = exp(x = olm11.1$coefficients),
                     CI = exp(x = confint(object = olm11.1)))

#> Waiting for profiling to be done...

#> 
#> Re-fitting to get Hessian

Code

or_sal

#>                              OR  CI.2.5.. CI.97.5..
#> job_catOther Sciences 0.5064636 0.3348739 0.7649294
#> job_catNonscience     0.4790949 0.3522406 0.6500583
#> age                   1.0117454 1.0020218 1.0215952

Listing / Output 11.34: Odds ratios and confidence intervals

Besides that I get different numbers because of different case distribution from the sampling procedures, the values here are — in contrast to the predict values — plausible.

For ordinal models, the odds ratios are called proportional odds ratios and are the increases or decreases in odds of being in a higher group or groups versus all the lower groups for each one-unit increase in the predictor.

As an example follows the other sciences predictor:

Compared to females with computer science, math, or engineering jobs, females with non-science jobs have 52% lower odds of being very satisfied compared to the other satisfaction categories (OR = 0.48; 95% CI: 0.35–0.65).
Compared to females with computer science, math, or engineering jobs, females with non-science jobs have 52% lower odds of being very or somewhat satisfied compared to the other satisfaction categories (OR = 0.48; 95% CI: 0.35–0.65).
Compared to females with computer science, math, or engineering jobs, females with non-science jobs have 52% lower odds of being very or somewhat satisfied or somewhat dissatisfied compared to very dissatisfied (OR = 0.48; 95% CI: 0.35–0.65).

11.9 Achievement 6: Checking assumptions

Besides the assumption of independent observations that was already several times met, the proportional odds is another assumption for ordered logistic regression models.

The proportional odds assumption requires that the odds for every shift from category to category of the outcome are the same for each level of the predictors. For example, say the odds of someone being in the very dissatisfied group compared to the somewhat dissatisfied group were 2.5 times higher in the non-science group compared to the computer science, math, or engineering group. The proportional odds assumption would require that the odds of someone being in the somewhat dissatisfied group compared to the somewhat satisfied group would also need to be 2.5 times higher in the non-science compared to computer science, math, and engineering group.

In the {ordinal} package, the nominal_test() function tests the proportional odds assumption. The null hypothesis for this is that the proportional odds assumption is met; the alternate hypothesis is that the proportional odds assumption is not met. The assumption is checked for each predictor in the model.

Unfortunately, nominal_test() does not like polr objects, so the model has to be re-estimated using clm() from the ordinal package.

R Code 11.35 : Checking proportional odds assumption for an ordered logistic regression model

Code

# check proportional odds assumption

ordinal::nominal_test(
  object = ordinal::clm(
    formula = satis_salary ~ job_cat + age,
    data = tbl11.4
    )
  )

#> Tests of nominal effects
#> 
#> formula: satis_salary ~ job_cat + age
#>         Df  logLik    AIC    LRT Pr(>Chi)
#> <none>     -1257.7 2527.4                
#> job_cat  4 -1254.5 2528.9 6.5173   0.1637
#> age      2 -1257.6 2531.3 0.1408   0.9320

Listing / Output 11.35: Test proportional odds assumption with ordinal::nominal_test()

The p-values are .16 and .93, so the null hypothesis is retained for the job type predictor and for the age predictor. The proportional odds assumption is met for this model.

11.9.1 Summary

Report 11.8

An ordinal logistic regression using job type and age was better at predicting satisfaction with salary among female college graduates than a baseline model including no predictors [χ2(3) = 28.89; p < .001]. - The model met its assumptions and correctly predicted 455 of the 1,000 (45,5%) observations. - Female college graduates in non-science careers have 49% lower odds than those in computer science, math, or engineering of being very satisfied with their salary compared to being in another satisfaction group. - Likewise, female college graduates in other sciences had 52% lower odds than those in computer science, math, or engineering of being very satisfied with their salary compared to being in another satisfaction group. - There was no significant relationship between age and salary satisfaction.

11.10 Exercises (empty)

11.11 Glossary

term	definition
Chi-squared	Chi-squared is the test statistic following the chi-squared probability distribution; the chi-squared test statistic is used in inferential tests, including examining the association between two categorical variables and determining statistical significance for a logistic regression model. (SwR, Glossary)
Hausman-McFadden	The Hausman-McFadden test is a consistency test for multinomial logit models. If the independance of irrelevant alternatives applies, the probability ratio of every two alternatives depends only on the characteristics of these alternatives. (help page hfmtest() in the {mlogit} package)
IIA	Independence of irrelevant alternatives (IIA) is an assumption of multinomial logistic regression that requires the categories of the outcome to be independent of one another; for a three-category outcome variable with Categories A, B, and C, the probability of being in Category A compared to Category B cannot change because Category C exists. (SwR, Glossary)
Kruskal-Wallis	Kruskal-Wallis test is used to compare ranks across three or more groups when the normal distribution assumption fails for analysis of variance (ANOVA) (SwR, Glossary)
Null Hypothesis	The null hypothesis (H0, or simply the Null) is a statement of no difference or no association that is used to guide statistical inference testing (SwR, Glossary)
p-value	The p-value is the probability that the test statistic is at least as big as it is under the null hypothesis (SwR, Glossary)
Proportional-Odds-Ratios	Proportional odds ratios are odds ratios resulting from ordinal regression that represent the odds of being in a higher group or groups compared to being in all the lower groups with each one-unit increase in the predictor. (SwR, Glossary)
SwR	SwR is my abbreviation of: Harris, J. K. (2020). Statistics With R: Solving Problems Using Real-World Data (Illustrated Edition). SAGE Publications, Inc.

Session Info

Code

sessioninfo::session_info()

#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.4.2 (2024-10-31)
#>  os       macOS Sequoia 15.1
#>  system   x86_64, darwin20
#>  ui       X11
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       Europe/Vienna
#>  date     2024-11-13
#>  pandoc   3.5 @ /usr/local/bin/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package      * version     date (UTC) lib source
#>  base64enc      0.1-3       2015-07-28 [2] CRAN (R 4.4.0)
#>  class          7.3-22      2023-05-03 [2] CRAN (R 4.4.2)
#>  cli            3.6.3       2024-06-21 [2] CRAN (R 4.4.1)
#>  codetools      0.2-20      2024-03-31 [2] CRAN (R 4.4.2)
#>  colorspace     2.1-1       2024-07-26 [2] CRAN (R 4.4.0)
#>  commonmark     1.9.2       2024-10-04 [2] CRAN (R 4.4.1)
#>  crayon         1.5.3       2024-06-20 [2] CRAN (R 4.4.0)
#>  curl           6.0.0       2024-11-05 [2] CRAN (R 4.4.1)
#>  DBI            1.2.3       2024-06-02 [2] CRAN (R 4.4.0)
#>  dfidx          0.1-0       2024-08-22 [2] CRAN (R 4.4.1)
#>  digest         0.6.37      2024-08-19 [2] CRAN (R 4.4.1)
#>  dplyr          1.1.4       2023-11-17 [2] CRAN (R 4.4.0)
#>  e1071          1.7-16      2024-09-16 [2] CRAN (R 4.4.1)
#>  evaluate       1.0.1       2024-10-10 [2] CRAN (R 4.4.1)
#>  fansi          1.0.6       2023-12-08 [2] CRAN (R 4.4.0)
#>  farver         2.1.2       2024-05-13 [2] CRAN (R 4.4.0)
#>  fastmap        1.2.0       2024-05-15 [2] CRAN (R 4.4.0)
#>  forcats        1.0.0       2023-01-29 [2] CRAN (R 4.4.0)
#>  Formula        1.2-5       2023-02-24 [2] CRAN (R 4.4.0)
#>  generics       0.1.3       2022-07-05 [2] CRAN (R 4.4.0)
#>  ggplot2        3.5.1       2024-04-23 [2] CRAN (R 4.4.0)
#>  glossary     * 1.0.0.9003  2024-08-05 [2] Github (debruine/glossary@05e4a61)
#>  glue           1.8.0       2024-09-30 [2] CRAN (R 4.4.1)
#>  gtable         0.3.6       2024-10-25 [2] CRAN (R 4.4.1)
#>  haven          2.5.4       2023-11-30 [2] CRAN (R 4.4.0)
#>  here           1.0.1       2020-12-13 [2] CRAN (R 4.4.0)
#>  hms            1.1.3       2023-03-21 [2] CRAN (R 4.4.0)
#>  htmltools      0.5.8.1     2024-04-04 [2] CRAN (R 4.4.0)
#>  htmlwidgets    1.6.4       2023-12-06 [2] CRAN (R 4.4.0)
#>  jsonlite       1.8.9       2024-09-20 [2] CRAN (R 4.4.1)
#>  kableExtra     1.4.0       2024-01-24 [2] CRAN (R 4.4.0)
#>  knitr          1.49        2024-11-08 [2] CRAN (R 4.4.1)
#>  labeling       0.4.3       2023-08-29 [2] CRAN (R 4.4.0)
#>  labelled       2.13.0      2024-04-23 [2] CRAN (R 4.4.0)
#>  lattice        0.22-6      2024-03-20 [2] CRAN (R 4.4.2)
#>  lifecycle      1.0.4       2023-11-07 [2] CRAN (R 4.4.0)
#>  lmtest         0.9-40      2022-03-21 [2] CRAN (R 4.4.0)
#>  magrittr       2.0.3       2022-03-30 [2] CRAN (R 4.4.0)
#>  markdown       1.13        2024-06-04 [2] CRAN (R 4.4.0)
#>  MASS           7.3-61      2024-06-13 [2] CRAN (R 4.4.2)
#>  Matrix         1.7-1       2024-10-18 [2] CRAN (R 4.4.2)
#>  mitools        2.4         2019-04-26 [2] CRAN (R 4.4.0)
#>  mlogit         1.1-1       2020-10-02 [2] CRAN (R 4.4.0)
#>  munsell        0.5.1       2024-04-01 [2] CRAN (R 4.4.0)
#>  nlme           3.1-166     2024-08-14 [2] CRAN (R 4.4.2)
#>  nnet           7.3-19      2023-05-03 [2] CRAN (R 4.4.2)
#>  numDeriv       2016.8-1.1  2019-06-06 [2] CRAN (R 4.4.0)
#>  ordinal        2023.12-4.1 2024-08-19 [2] CRAN (R 4.4.1)
#>  pillar         1.9.0       2023-03-22 [2] CRAN (R 4.4.0)
#>  pkgconfig      2.0.3       2019-09-22 [2] CRAN (R 4.4.0)
#>  proxy          0.4-27      2022-06-09 [2] CRAN (R 4.4.0)
#>  purrr          1.0.2       2023-08-10 [2] CRAN (R 4.4.0)
#>  R6             2.5.1       2021-08-19 [2] CRAN (R 4.4.0)
#>  rbibutils      2.3         2024-10-04 [2] CRAN (R 4.4.1)
#>  RColorBrewer   1.1-3       2022-04-03 [2] CRAN (R 4.4.0)
#>  Rcpp           1.0.13-1    2024-11-02 [1] CRAN (R 4.4.1)
#>  Rdpack         2.6.1       2024-08-06 [2] CRAN (R 4.4.1)
#>  repr           1.1.7       2024-03-22 [2] CRAN (R 4.4.0)
#>  rlang          1.1.4       2024-06-04 [2] CRAN (R 4.4.0)
#>  rmarkdown      2.29        2024-11-04 [2] CRAN (R 4.4.1)
#>  rprojroot      2.0.4       2023-11-05 [2] CRAN (R 4.4.0)
#>  rstudioapi     0.17.1      2024-10-22 [2] CRAN (R 4.4.1)
#>  rversions      2.1.2       2022-08-31 [2] CRAN (R 4.4.0)
#>  scales         1.3.0       2023-11-28 [2] CRAN (R 4.4.0)
#>  sessioninfo    1.2.2       2021-12-06 [2] CRAN (R 4.4.0)
#>  skimr          2.1.5       2022-12-23 [2] CRAN (R 4.4.0)
#>  statmod        1.5.0       2023-01-06 [2] CRAN (R 4.4.0)
#>  stringi        1.8.4       2024-05-06 [2] CRAN (R 4.4.0)
#>  stringr        1.5.1       2023-11-14 [2] CRAN (R 4.4.0)
#>  survey         4.4-2       2024-03-20 [2] CRAN (R 4.4.0)
#>  survival       3.7-0       2024-06-05 [2] CRAN (R 4.4.2)
#>  svglite        2.1.3       2023-12-08 [2] CRAN (R 4.4.0)
#>  systemfonts    1.1.0       2024-05-15 [2] CRAN (R 4.4.0)
#>  tableone       0.13.2      2022-04-15 [2] CRAN (R 4.4.0)
#>  tibble         3.2.1       2023-03-20 [2] CRAN (R 4.4.0)
#>  tidyr          1.3.1       2024-01-24 [2] CRAN (R 4.4.0)
#>  tidyselect     1.2.1       2024-03-11 [2] CRAN (R 4.4.0)
#>  ucminf         1.2.2       2024-06-24 [2] CRAN (R 4.4.0)
#>  utf8           1.2.4       2023-10-22 [2] CRAN (R 4.4.0)
#>  vctrs          0.6.5       2023-12-01 [2] CRAN (R 4.4.0)
#>  viridisLite    0.4.2       2023-05-02 [2] CRAN (R 4.4.0)
#>  withr          3.0.2       2024-10-28 [2] CRAN (R 4.4.1)
#>  xfun           0.49        2024-10-31 [2] CRAN (R 4.4.1)
#>  xml2           1.3.6       2023-12-04 [2] CRAN (R 4.4.0)
#>  yaml           2.3.10      2024-07-26 [2] CRAN (R 4.4.0)
#>  zoo            1.8-12      2023-04-13 [2] CRAN (R 4.4.0)
#> 
#>  [1] /Library/Frameworks/R.framework/Versions/4.4-x86_64/library
#>  [2] /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────