Chapter 10 Poisson Regression

10.1 Objectives

After completing this chapter, the readers are expected to

understand the basic concepts behind Poisson regression for count and rate data
perform Poisson regression for count and rate
perform model fit assessment
present and interpret the results of Poisson regression analyses

10.2 Introduction

Poisson regression is a regression analysis for count and rate data. As mentioned before in Chapter 7, it is is a type of Generalized linear models (GLMs) whenever the outcome is count. It also accommodates rate data as we will see shortly. Although count and rate data are very common in medical and health sciences, in our experience, Poisson regression is underutilized in medical research. Most often, researchers end up using linear regression because they are more familiar with it and lack of exposure to the advantage of using Poisson regression to handle count and rate data.

Count is discrete numerical data. Although it is convenient to use linear regression to handle the count outcome by assuming the count or discrete numerical data (e.g. the number of hospital admissions) as continuous numerical data (e.g. systolic blood pressure in mmHg), it may result in illogical predicted values. For example, by using linear regression to predict the number of asthmatic attacks in the past one year, we may end up with a negative number of attacks, which does not make any clinical sense! So, it is recommended that medical researchers get familiar with Poisson regression and make use of it whenever the outcome variable is a count variable.

Another reason for using Poisson regression is whenever the number of cases (e.g. deaths, accidents) is small relative to the number of no events (e.g. alive, no accident), then it makes more sense to just get the information from the cases in a population of interest, instead of also getting the information from the non-cases as in typical cohort and case-control studies. For example, in the publicly available COVID-19 data, only the number of deaths were reported along with some basic sociodemographic and clinical information for the cases. Whenever the information for the non-cases are available, it is quite easy to instead use logistic regression for the analysis.

Basically, Poisson regression models the linear relationship between:

outcome: count variable (e.g. the number of hospital admissions, parity, cancerous lesions, asthmatic attacks). This is transformed into the natural log scale.
predictors/independent variables: numerical variables (e.g. age, blood pressure, income) and categorical variables (e.g. gender, race, education level).

We might be interested in knowing the relationship between the number of asthmatic attacks in the past one year with sociodemographic factors. This relationship can be explored by a Poisson regression analysis.

Basically, for Poisson regression, the relationship between the outcome and predictors is as follows,

\[\begin{aligned} natural\ log\ of\ count\ outcome = &\ numerical\ predictors \\ & + categorical\ predictors \end{aligned}\]

At times, the count is proportional to a denominator. For example, given the same number of deaths, the death rate in a small population will be higher than the rate in a large population. If we were to compare the the number of deaths between the populations, it would not make a fair comparison. Thus, we may consider adding denominators in the Poisson regression modelling in the forms of offsets. This denominator could also be the unit time of exposure, for example person-years of cigarette smoking. This will be explained later under Poisson regression for rate section.

In the previous chapter, we learned that logistic regression allows us to obtain the odds ratio, which is approximately the relative risk given a predictor. For Poisson regression, by taking the exponent of the coefficient, we obtain the rate ratio RR (also known as incidence rate ratio IRR),

\[RR=exp(b_{p})\] for the coefficient $b_p$ of the p’s predictor. This is interpreted in similar way to the odds ratio for logistic regression, which is approximately the relative risk given a predictor.

10.3 Prepare R Environment for Analysis

10.3.1 Libraries

For this chapter, we will be using the following packages:

tidyverse: a general and powerful package for data transformation
psych: for descriptive statistics
gtsummary: for coming up with nice tables for results
broom: for tidying up the results
epiDisplay: an epidemiological data display package

These are loaded as follows using the function library(),

library(tidyverse)
library(psych)
library(gtsummary)
library(broom)

For epiDisplay, we will use the package directly using epiDisplay::function_name() instead. We did not load the package as we usually do with library(epiDisplay) because it has some conflicts with the packages we loaded above.

10.4 Poisson regression for Count

10.4.1 About Poisson regression for count

Poisson regression models the linear relationship between:

outcome: count variable (e.g. the number of hospital admissions, parity, cancerous lesions, asthmatic attacks). This is transformed into the natural log scale.
predictor(s): numerical variables (e.g. age, blood pressure, income) and categorical variables (e.g. gender, race, education level).

Multiple Poisson regression for count is given as,

\[\begin{aligned} ln(count\ outcome) = &\ intercept \\ & + coefficients \times numerical\ predictors \\ & + coefficients \times categorical\ predictors \end{aligned}\]

or in a shorter form,

\[ln(\hat y) = b_0 + b_1x_1 + b_2x_2 + ... + b_px_p\] where we have p predictors.

10.4.2 Dataset

The data on the number of asthmatic attacks per year among a sample of 120 patients and the associated factors are given in asthma.csv.

The dataset contains four variables:

gender: Gender of the subjects (categorical) {male, female}.
res_inf: Recurrent respiratory infection (categorical) {no, yes}.
ghq12: General Health Questionnare 12 (GHQ-12) score of psychological well being (numerical) {0 to 36}.
attack: Number of athmatic attack per year (count).

The dataset is loaded as follows,

asthma = read.csv("data/asthma.csv")

We then look at the basic structure of the dataset,

str(asthma)

## 'data.frame':    120 obs. of  4 variables:
##  $ gender : chr  "female" "male" "male" "female" ...
##  $ res_inf: chr  "yes" "no" "yes" "yes" ...
##  $ ghq12  : int  21 17 30 22 27 33 24 23 25 28 ...
##  $ attack : int  6 4 8 5 2 3 2 1 2 2 ...

10.4.3 Data exploration

For descriptive statistics, we introduce the epidisplay package. It is a nice package that allows us to easily obtain statistics for both numerical and categorical variables at the same time. We use codebook() function from the package.

epiDisplay::codebook(asthma)

## 
##  
##  
## gender    :    
## A character vector 
## 
##  ================== 
## res_inf   :    
## A character vector 
## 
##  ================== 
## ghq12     :    
##  obs. mean   median  s.d.   min.   max.  
##  120  16.342 19      9.81   0      33    
## 
##  ================== 
## attack    :    
##  obs. mean   median  s.d.   min.   max.  
##  120  2.458  2       2.012  0      9     
## 
##  ==================

10.4.4 Univariable analysis

For the univariable analysis, we fit univariable Poisson regression models for gender (gender), recurrent respiratory infection (res_inf) and GHQ12 (ghq12) variables.

# gender
pois_attack1 = glm(attack ~ gender, data = asthma, family = "poisson")
summary(pois_attack1)

## 
## Call:
## glm(formula = attack ~ gender, family = "poisson", data = asthma)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.02105    0.07332  13.925   <2e-16 ***
## gendermale  -0.30000    0.12063  -2.487   0.0129 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 229.56  on 119  degrees of freedom
## Residual deviance: 223.23  on 118  degrees of freedom
## AIC: 500.3
## 
## Number of Fisher Scoring iterations: 5

# rec_res_inf
pois_attack2 = glm(attack ~ res_inf, data = asthma, family = "poisson")
summary(pois_attack2)

## 
## Call:
## glm(formula = attack ~ res_inf, family = "poisson", data = asthma)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.2877     0.1213   2.372   0.0177 *  
## res_infyes    0.9032     0.1382   6.533 6.44e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 229.56  on 119  degrees of freedom
## Residual deviance: 180.49  on 118  degrees of freedom
## AIC: 457.56
## 
## Number of Fisher Scoring iterations: 5

# ghq12
pois_attack3 = glm(attack ~ ghq12, data = asthma, family = "poisson")
summary(pois_attack3)

## 
## Call:
## glm(formula = attack ~ ghq12, family = "poisson", data = asthma)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.230923   0.159128  -1.451    0.147    
## ghq12        0.059500   0.006919   8.599   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 229.56  on 119  degrees of freedom
## Residual deviance: 145.13  on 118  degrees of freedom
## AIC: 422.2
## 
## Number of Fisher Scoring iterations: 5

From the outputs, all variables are important with P < .25. These variables are the candidates for inclusion in the multivariable analysis. However, as a reminder, in the context of confirmatory research, the variables that we want to include must consider expert judgement.

10.4.5 Multivariable analysis

For the multivariable analysis, we included all variables as predictors of attack. Here we use dot . as a shortcut for all variables when specifying the right-hand side of the formula of the glm.

pois_attack_all = glm(attack ~ ., data = asthma, family = "poisson")
summary(pois_attack_all)

## 
## Call:
## glm(formula = attack ~ ., family = "poisson", data = asthma)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.315387   0.183500  -1.719  0.08566 .  
## gendermale  -0.041905   0.122469  -0.342  0.73222    
## res_infyes   0.426431   0.152859   2.790  0.00528 ** 
## ghq12        0.049508   0.007878   6.285 3.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 229.56  on 119  degrees of freedom
## Residual deviance: 136.68  on 116  degrees of freedom
## AIC: 417.75
## 
## Number of Fisher Scoring iterations: 5

From the output, we noted that gender is not significant with P > 0.05, although it was significant at the univariable analysis. Now, we fit a model excluding gender,

# minus gender
pois_attack_all1 = glm(attack ~ res_inf + ghq12, data = asthma, 
                       family = "poisson")
summary(pois_attack_all1)

## 
## Call:
## glm(formula = attack ~ res_inf + ghq12, family = "poisson", data = asthma)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.34051    0.16823  -2.024  0.04296 *  
## res_infyes   0.42816    0.15282   2.802  0.00508 ** 
## ghq12        0.04989    0.00779   6.404 1.51e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 229.56  on 119  degrees of freedom
## Residual deviance: 136.80  on 117  degrees of freedom
## AIC: 415.86
## 
## Number of Fisher Scoring iterations: 5

From the output, both variables are significant predictors of asthmatic attack (or more accurately the natural log of the count of asthmatic attack). This serves as our preliminary model.

10.4.6 Interaction

Now, we include a two-way interaction term between res_inf and ghq12.

pois_attack_allx = glm(attack ~ res_inf * ghq12, data = asthma, 
                       family = "poisson")
summary(pois_attack_allx)

## 
## Call:
## glm(formula = attack ~ res_inf * ghq12, family = "poisson", data = asthma)
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -0.63436    0.23408  -2.710  0.00673 ** 
## res_infyes        1.01927    0.32822   3.105  0.00190 ** 
## ghq12             0.06834    0.01186   5.763 8.29e-09 ***
## res_infyes:ghq12 -0.03135    0.01531  -2.048  0.04056 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 229.56  on 119  degrees of freedom
## Residual deviance: 132.67  on 116  degrees of freedom
## AIC: 413.74
## 
## Number of Fisher Scoring iterations: 5

It turns out that the interaction term res_inf * ghq12 is significant. We may include this interaction term in the final model. However, this might complicate our interpretation of the result as we can no longer interpret individual coefficients. Given that the P-value of the interaction term is close to the commonly used significance level of 0.05, we may choose to ignore this interaction. But keep in mind that the decision is yours, the analyst. Having said that, if the purpose of modelling is mainly for prediction, the issue is less severe because we are more concerned with the predicted values than with the clinical interpretation of the result. However, if you insist on including the interaction, it can be done by writing down the equation for the model, substitute the value of res_inf with yes = 1 or no = 0, and obtain the coefficient for ghq12. We will see how to do this under Presentation and interpretation below.

10.4.7 Model fit assessment

For Poisson regression, we assess the model fit by chi-square goodness-of-fit test, model-to-model AIC comparison and scaled Pearson chi-square statistic. We also assess the regression diagnostics using standardized residuals.

Chi-square goodness-of-fit

Chi-square goodness-of-fit test can be performed using poisgof() function in epiDisplay package. Note that, instead of using Pearson chi-square statistic, it utilizes residual deviance with its respective degrees of freedom (df) (e.g. from the output of summary(pois_attack_all1) above). A P-value > 0.05 indicates good model fit.

epiDisplay::poisgof(pois_attack_all1)

## $results
## [1] "Goodness-of-fit test for Poisson assumption"
## 
## $chisq
## [1] 136.7964
## 
## $df
## [1] 117
## 
## $p.value
## [1] 0.101934

Model-to-model AIC comparison

We may also compare the models that we fit so far by Akaike information criterion (AIC). Recall that R uses AIC for stepwise automatic variable selection, which was explained in Linear Regression chapter.

AIC(pois_attack1, pois_attack2, pois_attack3, 
    pois_attack_all1, pois_attack_allx)

##                  df      AIC
## pois_attack1      2 500.3009
## pois_attack2      2 457.5555
## pois_attack3      2 422.1997
## pois_attack_all1  3 415.8649
## pois_attack_allx  4 413.7424

The best model is the one with the lowest AIC, which is the model model with the interaction term. However, since the model with the interaction term differ slightly from the model without interaction, we may instead choose the simpler model without the interaction term.

Scaled Pearson chi-square statistic

Pearson chi-square statistic divided by its df gives rise to scaled Pearson chi-square statistic (Fleiss, Levin, and Paik 2003). The closer the value of this statistic to 1, the better is the model fit. First, Pearson chi-square statistic is calculated as,

\[\chi^2_P = \sum_{i=1}^n \frac{(y_i - \hat y_i)^2}{\hat y_i}\] easily obtained in R as below.

x2 = sum((asthma$attack - pois_attack_all1$fitted)^2 / 
           pois_attack_all1$fitted)

Then we obtain scaled Pearson chi-square statistic $\chi^2_P / df$, where $df = n - p$. $n$ is the number of observations nrow(asthma) and $p$ is the number of coefficients/parameters we estimated for the model length(pois_attack_all1$coefficients).

df = nrow(asthma) - length(pois_attack_all1$coefficients)
sx2 = x2 / df; sx2

## [1] 1.052376

The value of sx2 is 1.052, which is close to 1. This indicates good model fit.

It is actually easier to obtain scaled Pearson chi-square by changing the family = "poisson" to family = "quasipoisson" in the glm specification, then viewing the dispersion value from the summary of the model. It works because scaled Pearson chi-square is an estimator of the overdispersion parameter in a quasi-Poisson regression model (Fleiss, Levin, and Paik 2003). We will discuss about quasi-Poisson regression later towards the end of this chapter.

qpois_attack_all1_summ = summary(glm(attack ~ res_inf + ghq12, 
                                     data = asthma, family = "quasipoisson"))
qpois_attack_all1_summ$dispersion

## [1] 1.052383

Regression diagnostics

Here, we use standardized residuals using rstandard() function. Because it is in form of standardized z score, we may use specific cutoffs to find the outliers, for example 1.96 (for $\alpha$ = 0.05) or 3.89 (for $\alpha$ = 0.0001).

std_res = rstandard(pois_attack_all1)
std_res[abs(std_res) > 1.96]

##        28        38        95       103 
## -2.082287  2.168737  1.977013  2.461419

We now locate where the discrepancies are,

index = names(std_res[abs(std_res) > 1.96])
cbind(asthma[index,], attack_pred = pois_attack_all1$fitted[index])

##     gender res_inf ghq12 attack attack_pred
## 28  female     yes    29      1   4.6384274
## 38  female     yes    26      9   3.9936854
## 95    male      no     1      3   0.7477996
## 103 female      no    19      6   1.8355524

10.4.8 Presentation and interpretation

Model without interaction

After all these assumption check points, we decide on the final model and rename the model for easier reference.

pois_attack_final = pois_attack_all1

We use tbl_regression() to come up with a table for the results. Here, for interpretation, we exponentiate the coefficients to obtain the incidence rate ratio, IRR.

tbl_regression(pois_attack_final, exponentiate = TRUE)

Characteristic	IRR¹	95% CI¹	p-value
res_inf
no	—	—
yes	1.53	1.14, 2.08	0.005
ghq12	1.05	1.04, 1.07	<0.001
¹ IRR = Incidence Rate Ratio, CI = Confidence Interval

Based on this table, we may interpret the results as follows:

Those with recurrent respiratory infection are at higher risk of having an asthmatic attack with an IRR of 1.53 (95% CI: 1.14, 2.08), while controlling for the effect of GHQ-12 score.
An increase in GHQ-12 score by one mark increases the risk of having an asthmatic attack by 1.05 (95% CI: 1.04, 1.07), while controlling for the effect of recurrent respiratory infection.

We can also view and save the output in a format suitable for exporting to the spreadsheet format for later use. We use tidy() function for the job,

tib_pois_attack = tidy(pois_attack_final, exponentiate = TRUE, 
                       conf.int = TRUE)
tib_pois_attack

## # A tibble: 3 × 7
##   term        estimate std.error statistic  p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 (Intercept)    0.711   0.168       -2.02 4.30e- 2    0.505     0.978
## 2 res_infyes     1.53    0.153        2.80 5.08e- 3    1.14      2.08 
## 3 ghq12          1.05    0.00779      6.40 1.51e-10    1.04      1.07

and export it to a .csv file,

write.csv(tib_pois_attack, "tib_pois_attack.csv")

Then, we display the coefficients (i.e. without the exponent) and transfer the values into an equation,

round(summary(pois_attack_final)$coefficients, 2)

##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)    -0.34       0.17   -2.02     0.04
## res_infyes      0.43       0.15    2.80     0.01
## ghq12           0.05       0.01    6.40     0.00

\[\begin{aligned} ln(attack) = & -0.34 + 0.43\times res\_inf + 0.05\times ghq12 \end{aligned}\]

From the table and equation above, the effect of an increase in GHQ-12 score is by one mark might not be clinically of interest. Let say, as a clinician we want to know the effect of an increase in GHQ-12 score by six marks instead, which is 1/6 of the maximum score of 36. From the coefficient for GHQ-12 of 0.05, the risk is calculated as

\[IRR_{GHQ12\ by\ 6} = exp(0.05\times 6) = 1.35\]

Or we may fit the model again with some adjustment to the data and glm specification. First, we divide ghq12 values by 6 and save the values into a new variable ghq12_by6, followed by fitting the model again using the edited data set and new variable,

# Divide ghq12 by 6
asthma_new = asthma %>% mutate(ghq12_by6 = ghq12 / 6)
# Fit the model
pois_attack_final_by6 = glm(attack ~ res_inf + ghq12_by6, 
                            data = asthma_new, family = "poisson")

Now we view the results for the re-fitted model,

# coeffients
tidy(pois_attack_final_by6, conf.int = TRUE)

## # A tibble: 3 × 7
##   term        estimate std.error statistic  p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 (Intercept)   -0.341    0.168      -2.02 4.30e- 2   -0.682   -0.0222
## 2 res_infyes     0.428    0.153       2.80 5.08e- 3    0.135    0.734 
## 3 ghq12_by6      0.299    0.0467      6.40 1.51e-10    0.209    0.392

# IRRs
tidy(pois_attack_final_by6, exponentiate = TRUE, conf.int = TRUE)

## # A tibble: 3 × 7
##   term        estimate std.error statistic  p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 (Intercept)    0.711    0.168      -2.02 4.30e- 2    0.505     0.978
## 2 res_infyes     1.53     0.153       2.80 5.08e- 3    1.14      2.08 
## 3 ghq12_by6      1.35     0.0467      6.40 1.51e-10    1.23      1.48

As compared to the first method that requires multiplying the coefficient manually, the second method is preferable in R as we also get the 95% CI for ghq12_by6.

Model with interaction

Now we will go through the interpretation of the model with interaction. We display the coefficients for the model with interaction (pois_attack_allx) and enter the values into an equation,

round(summary(pois_attack_allx)$coefficients, 2)

##                  Estimate Std. Error z value Pr(>|z|)
## (Intercept)         -0.63       0.23   -2.71     0.01
## res_infyes           1.02       0.33    3.11     0.00
## ghq12                0.07       0.01    5.76     0.00
## res_infyes:ghq12    -0.03       0.02   -2.05     0.04

\[\begin{aligned} ln(attack) = & -0.34 + 0.43\times res\_inf + 0.05\times ghq12 \\ & -0.03\times res\_inf\times ghq12 \end{aligned}\]

As we need to interpret the coefficient for ghq12 by the status of res_inf, we write an equation for each res_inf status. When res_inf = 1 (yes),

\[\begin{aligned} ln(attack) = & -0.63 + 1.02\times res\_inf + 0.07\times ghq12 \\ & -0.03\times res\_inf\times ghq12 \\ = & -0.63 + 1.02\times 1 + 0.07\times ghq12 -0.03\times 1\times ghq12 \\ = &\ 0.39 + 0.04\times ghq12 \end{aligned}\]

and when res_inf = 0 (no),

\[\begin{aligned} ln(attack) = & -0.63 + 1.02\times res\_inf + 0.07\times ghq12 \\ & -0.03\times res\_inf\times ghq12 \\ = & -0.63 + 1.02\times 0 + 0.07\times ghq12 -0.03\times 0\times ghq12 \\ = & -0.63 + 0.07\times ghq12 \end{aligned}\]

Now, based on the equations, we may interpret the results as follows:

For those with recurrent respiratory infection, an increase in GHQ-12 score by one mark increases the risk of having an asthmatic attack by 1.04 (IRR = exp[0.04]).
For those without recurrent respiratory infection, an increase in GHQ-12 score by one mark increases the risk of having an asthmatic attack by 1.07 (IRR = exp[0.07]).

Based on these IRRs, the effect of an increase of GHQ-12 score is slightly higher for those without recurrent respiratory infection. However, in comparison to the IRR for an increase in GHQ-12 score by one mark in the model without interaction, with IRR = exp(0.05) = 1.05. So there are minimal differences in the IRR values for GHQ-12 between the models, thus in this case the simpler Poisson regression model without interaction is preferable. But now, you get the idea as to how to interpret the model with an interaction term.

10.4.9 Prediction

We can use the final model above for prediction. Relevant to our data set, we may want to know the expected number of asthmatic attacks per year for a patient with recurrent respiratory infection and GHQ-12 score of 8,

pred = predict(pois_attack_final, list(res_inf = "yes", ghq12 = 8), 
               type = "response")
round(pred, 1)

##   1 
## 1.6

Now, let’s say we want to know the expected number of asthmatic attacks per year for those with and without recurrent respiratory infection for each 12-mark increase in GHQ-12 score.

new_data = data.frame(res_inf = rep(c("yes", "no"), each = 4),
                      ghq12 = rep(c(0, 12, 24, 36), 2))
new_data$attack_pred = round(predict(pois_attack_final, new_data, 
                                     type = "response"), 1)
new_data

##   res_inf ghq12 attack_pred
## 1     yes     0         1.1
## 2     yes    12         2.0
## 3     yes    24         3.6
## 4     yes    36         6.6
## 5      no     0         0.7
## 6      no    12         1.3
## 7      no    24         2.4
## 8      no    36         4.3

10.5 Poisson regression for Rate

10.5.1 About Poisson regression for rate

At times, the count is proportional to a denominator. For example, given the same number of deaths, the death rate in a small population will be higher than the rate in a large population. If we were to compare the the number of deaths between the populations, it would not make a fair comparison. Thus, we may consider adding denominators in the Poisson regression modelling in form of offsets. This denominator could also be the unit time of exposure, for example person-years of cigarette smoking.

As mentioned before, counts can be proportional specific denominators, giving rise to rates. We may add the denominators in the Poisson regression modelling as offsets. Again, these denominators could be stratum size or unit time of exposure. Multiple Poisson regression for rate is specified by adding the offset in the form of the natural log of the denominator $t$. This is given as,

\[ln(\hat y) = ln(t) + b_0 + b_1x_1 + b_2x_2 + ... + b_px_p\]

10.5.2 Dataset

The data on the number of lung cancer cases among doctors, cigarettes per day, years of smoking and the respective person-years at risk of lung cancer are given in smoke.csv. The person-years variable serves as the offset for our analysis. The original data came from Doll (1971), which were analyzed in the context of Poisson regression by Frome (1983) and Fleiss, Levin, and Paik (2003). The dataset contains four variables:

smoke_yrs: Years of smoking (categorical) {15-19, 204-24, 25-29, 30-34, 35-39, 40-44, 45-49, 50-54, 55-59}.
cigar_day: Cigarettes per day (numerical).
person_yrs: Person-years at risk of lung cancer (numerical).
case: Number of lung cancer cases (count).

The dataset is loaded as follows,

smoke = read.csv("data/smoke.csv")

Then, we look at its data structure,

str(smoke)

## 'data.frame':    63 obs. of  4 variables:
##  $ smoke_yrs : chr  "15-19" "20-24" "25-29" "30-34" ...
##  $ cigar_day : num  0 0 0 0 0 0 0 0 0 5.2 ...
##  $ person_yrs: int  10366 8162 5969 4496 3512 2201 1421 1121 826 3121 ...
##  $ case      : int  1 0 0 0 0 0 0 0 2 0 ...

10.5.3 Data exploration

For descriptive statistics, we use epidisplay::codebook as before.

epiDisplay::codebook(smoke)

## 
##  
##  
## smoke_yrs     :    
## A character vector 
## 
##  ================== 
## cigar_day     :    
##  obs. mean   median  s.d.   min.   max.  
##  63   17.271 15.9    12.913 0      40.8  
## 
##  ================== 
## person_yrs    :    
##  obs. mean     median  s.d.     min.   max.  
##  63   2426.444 1826    2030.143 104    10366 
## 
##  ================== 
## case      :    
##  obs. mean   median  s.d.   min.   max.  
##  63   2.698  1       3.329  0      12    
## 
##  ==================

In addition, we are also interested to look at the observed rates,

smoke %>% mutate(rate = round(case/person_yrs, 4))

##    smoke_yrs cigar_day person_yrs case   rate
## 1      15-19       0.0      10366    1 0.0001
## 2      20-24       0.0       8162    0 0.0000
## 3      25-29       0.0       5969    0 0.0000
## 4      30-34       0.0       4496    0 0.0000
## 5      35-39       0.0       3512    0 0.0000
## 6      40-44       0.0       2201    0 0.0000
## 7      45-49       0.0       1421    0 0.0000
## 8      50-54       0.0       1121    0 0.0000
## 9      55-59       0.0        826    2 0.0024
## 10     15-19       5.2       3121    0 0.0000
## 11     20-24       5.2       2937    0 0.0000
## 12     25-29       5.2       2288    0 0.0000
## 13     30-34       5.2       2015    0 0.0000
## 14     35-39       5.2       1648    1 0.0006
## 15     40-44       5.2       1310    2 0.0015
## 16     45-49       5.2        927    0 0.0000
## 17     50-54       5.2        710    3 0.0042
## 18     55-59       5.2        606    0 0.0000
## 19     15-19      11.2       3577    0 0.0000
## 20     20-24      11.2       3286    1 0.0003
## 21     25-29      11.2       2546    1 0.0004
## 22     30-34      11.2       2219    2 0.0009
## 23     35-39      11.2       1826    0 0.0000
## 24     40-44      11.2       1386    1 0.0007
## 25     45-49      11.2        988    2 0.0020
## 26     50-54      11.2        684    4 0.0058
## 27     55-59      11.2        449    3 0.0067
## 28     15-19      15.9       4317    0 0.0000
## 29     20-24      15.9       4214    0 0.0000
## 30     25-29      15.9       3185    0 0.0000
## 31     30-34      15.9       2560    4 0.0016
## 32     35-39      15.9       1893    0 0.0000
## 33     40-44      15.9       1334    2 0.0015
## 34     45-49      15.9        849    2 0.0024
## 35     50-54      15.9        470    2 0.0043
## 36     55-59      15.9        280    5 0.0179
## 37     15-19      20.4       5683    0 0.0000
## 38     20-24      20.4       6385    1 0.0002
## 39     25-29      20.4       5483    1 0.0002
## 40     30-34      20.4       4687    6 0.0013
## 41     35-39      20.4       3646    5 0.0014
## 42     40-44      20.4       2411   12 0.0050
## 43     45-49      20.4       1567    9 0.0057
## 44     50-54      20.4        857    7 0.0082
## 45     55-59      20.4        416    7 0.0168
## 46     15-19      27.4       3042    0 0.0000
## 47     20-24      27.4       4050    1 0.0002
## 48     25-29      27.4       4290    4 0.0009
## 49     30-34      27.4       4268    9 0.0021
## 50     35-39      27.4       3529    9 0.0026
## 51     40-44      27.4       2424   11 0.0045
## 52     45-49      27.4       1409   10 0.0071
## 53     50-54      27.4        663    5 0.0075
## 54     55-59      27.4        284    3 0.0106
## 55     15-19      40.8        670    0 0.0000
## 56     20-24      40.8       1166    0 0.0000
## 57     25-29      40.8       1482    0 0.0000
## 58     30-34      40.8       1580    4 0.0025
## 59     35-39      40.8       1336    6 0.0045
## 60     40-44      40.8        924   10 0.0108
## 61     45-49      40.8        556    7 0.0126
## 62     50-54      40.8        255    4 0.0157
## 63     55-59      40.8        104    1 0.0096

10.5.4 Univariable analysis

For the univariable analysis, we fit univariable Poisson regression models for cigarettes per day (cigar_day), and years of smoking (smoke_yrs) variables. Offset or denominator is included as offset = log(person_yrs) in the glm option.

# cigar_day
pois_case1 = glm(case ~ cigar_day, data = smoke, family = "poisson", 
                 offset = log(person_yrs))
summary(pois_case1)

## 
## Call:
## glm(formula = case ~ cigar_day, family = "poisson", data = smoke, 
##     offset = log(person_yrs))
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -8.159268   0.175063  -46.61   <2e-16 ***
## cigar_day    0.070359   0.006468   10.88   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 445.10  on 62  degrees of freedom
## Residual deviance: 324.71  on 61  degrees of freedom
## AIC: 448.55
## 
## Number of Fisher Scoring iterations: 6

# smoke_yrs
pois_case2 = glm(case ~ smoke_yrs, data = smoke, family = "poisson", 
                 offset = log(person_yrs))
summary(pois_case2)

## 
## Call:
## glm(formula = case ~ smoke_yrs, family = "poisson", data = smoke, 
##     offset = log(person_yrs))
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -10.335      1.000 -10.335  < 2e-16 ***
## smoke_yrs20-24    1.117      1.155   0.968 0.333149    
## smoke_yrs25-29    1.990      1.080   1.842 0.065423 .  
## smoke_yrs30-34    3.563      1.020   3.493 0.000477 ***
## smoke_yrs35-39    3.615      1.024   3.532 0.000412 ***
## smoke_yrs40-44    4.580      1.013   4.521 6.15e-06 ***
## smoke_yrs45-49    4.785      1.016   4.707 2.52e-06 ***
## smoke_yrs50-54    5.085      1.020   4.987 6.14e-07 ***
## smoke_yrs55-59    5.384      1.024   5.261 1.44e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 445.10  on 62  degrees of freedom
## Residual deviance: 170.17  on 54  degrees of freedom
## AIC: 308.01
## 
## Number of Fisher Scoring iterations: 5

From the outputs, all variables including the dummy variables are important with P-values < .25.

10.5.5 Multivariable analysis

For the multivariable analysis, we included cigar_day and smoke_yrs as predictors of case.

pois_case = glm(case ~ cigar_day + smoke_yrs, data = smoke, 
                family = "poisson", offset = log(person_yrs))
summary(pois_case)

## 
## Call:
## glm(formula = case ~ cigar_day + smoke_yrs, family = "poisson", 
##     data = smoke, offset = log(person_yrs))
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -11.317162   1.007492 -11.233  < 2e-16 ***
## cigar_day        0.064361   0.006401  10.054  < 2e-16 ***
## smoke_yrs20-24   0.962417   1.154694   0.833  0.40457    
## smoke_yrs25-29   1.710894   1.080420   1.584  0.11330    
## smoke_yrs30-34   3.207676   1.020378   3.144  0.00167 ** 
## smoke_yrs35-39   3.242776   1.024187   3.166  0.00154 ** 
## smoke_yrs40-44   4.208361   1.013726   4.151 3.30e-05 ***
## smoke_yrs45-49   4.448972   1.017054   4.374 1.22e-05 ***
## smoke_yrs50-54   4.893674   1.019945   4.798 1.60e-06 ***
## smoke_yrs55-59   5.372600   1.023404   5.250 1.52e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 445.099  on 62  degrees of freedom
## Residual deviance:  68.015  on 53  degrees of freedom
## AIC: 207.86
## 
## Number of Fisher Scoring iterations: 5

From the output, both variables are significant predictors of the rate of lung cancer cases, although we noted the P-values are not significant for smoke_yrs20-24 and smoke_yrs25-29 dummy variables. This model serves as our preliminary model.

10.5.6 Interaction

Now, we include a two-way interaction term between cigar_day and smoke_yrs. From the output, although we noted that the interaction terms are not significant, the standard errors for cigar_day and the interaction terms are extremely large. This might point to a numerical issue with the model (Hosmer, Lemeshow, and Sturdivant 2013). So, we may drop the interaction term from our model.

pois_casex = glm(case ~ cigar_day * smoke_yrs, data = smoke,
                 family = "poisson", offset = log(person_yrs))

## Warning: glm.fit: fitted rates numerically 0 occurred

summary(pois_casex)

## 
## Call:
## glm(formula = case ~ cigar_day * smoke_yrs, family = "poisson", 
##     data = smoke, offset = log(person_yrs))
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               -9.2463     1.0000  -9.246  < 2e-16 ***
## cigar_day                 -2.8137   319.6351  -0.009 0.992976    
## smoke_yrs20-24            -0.7514     1.5114  -0.497 0.619084    
## smoke_yrs25-29            -0.2539     1.3518  -0.188 0.851030    
## smoke_yrs30-34             1.2493     1.1032   1.132 0.257489    
## smoke_yrs35-39             0.5856     1.1660   0.502 0.615498    
## smoke_yrs40-44             1.9706     1.0790   1.826 0.067800 .  
## smoke_yrs45-49             2.1070     1.0982   1.919 0.055044 .  
## smoke_yrs50-54             3.0710     1.0761   2.854 0.004318 ** 
## smoke_yrs55-59             3.5704     1.0706   3.335 0.000853 ***
## cigar_day:smoke_yrs20-24   2.8609   319.6351   0.009 0.992859    
## cigar_day:smoke_yrs25-29   2.8736   319.6351   0.009 0.992827    
## cigar_day:smoke_yrs30-34   2.8736   319.6351   0.009 0.992827    
## cigar_day:smoke_yrs35-39   2.8997   319.6351   0.009 0.992762    
## cigar_day:smoke_yrs40-44   2.8845   319.6351   0.009 0.992800    
## cigar_day:smoke_yrs45-49   2.8886   319.6351   0.009 0.992790    
## cigar_day:smoke_yrs50-54   2.8669   319.6351   0.009 0.992844    
## cigar_day:smoke_yrs55-59   2.8641   319.6351   0.009 0.992851    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 445.099  on 62  degrees of freedom
## Residual deviance:  60.597  on 45  degrees of freedom
## AIC: 216.44
## 
## Number of Fisher Scoring iterations: 18

10.5.7 Model fit assessment

Again, we assess the model fit by chi-square goodness-of-fit test, model-to-model AIC comparison and scaled Pearson chi-square statistic and standardized residuals.

# Chi-square goodness-of-fit
epiDisplay::poisgof(pois_case)

## $results
## [1] "Goodness-of-fit test for Poisson assumption"
## 
## $chisq
## [1] 68.01514
## 
## $df
## [1] 53
## 
## $p.value
## [1] 0.0802866

The P-value of chi-square goodness-of-fit is more than 0.05, which indicates the model has good fit.

# Model-to-model AIC comparison
AIC(pois_case1, pois_case2, pois_case)

##            df      AIC
## pois_case1  2 448.5549
## pois_case2  9 308.0129
## pois_case  10 207.8574

The comparison by AIC clearly shows that the multivariable model pois_case is the best model as it has the lowest AIC value.

# Scaled Pearson chi-square statistic using quasipoisson
qpois_case_summ = summary(glm(case ~ cigar_day + smoke_yrs, data = smoke, 
                              family = "quasipoisson", 
                              offset = log(person_yrs)))
qpois_case_summ$dispersion

## [1] 1.072516

The value of dispersion i.e. the scaled Pearson chi-square statistic is close to 1. This again indicates that the model has good fit.

# Standardized residuals
std_res = rstandard(pois_case)
std_res[abs(std_res) > 1.96]

##         6         8        18 
## -1.995530 -2.025055 -2.253832

index = names(std_res[abs(std_res) > 1.96])
# points of discrepancies
cbind(smoke[index,], case_pred = pois_case$fitted[index]) %>%
  mutate(rate = round(case/person_yrs, 4), 
         rate_pred = round(case_pred/person_yrs, 4))

##    smoke_yrs cigar_day person_yrs case case_pred rate rate_pred
## 6      40-44       0.0       2201    0  1.800143    0    0.0008
## 8      50-54       0.0       1121    0  1.819366    0    0.0016
## 18     55-59       5.2        606    0  2.218868    0    0.0037

Lastly, we noted only a few observations (number 6, 8 and 18) have discrepancies between the observed and predicted cases. This shows how well the fitted Poisson regression model for rate explains the data at hand.

10.5.8 Presentation and interpretation

Now, we decide on the final model,

pois_case_final = pois_case

and use tbl_regression() to come up with a table for the results. Again, for interpretation, we exponentiate the coefficients to obtain the incidence rate ratio, IRR.

tbl_regression(pois_case_final, exponentiate = TRUE)

Characteristic	IRR¹	95% CI¹	p-value
cigar_day	1.07	1.05, 1.08	<0.001
smoke_yrs
15-19	—	—
20-24	2.62	0.34, 52.9	0.4
25-29	5.53	0.94, 105	0.11
30-34	24.7	5.23, 442	0.002
35-39	25.6	5.35, 459	0.002
40-44	67.2	14.6, 1,195	<0.001
45-49	85.5	18.3, 1,525	<0.001
50-54	133	28.3, 2,385	<0.001
55-59	215	45.1, 3,862	<0.001
¹ IRR = Incidence Rate Ratio, CI = Confidence Interval

From this table, we interpret the IRR values as follows:

Taking an additional cigarette per day increases the risk of having lung cancer by 1.07 (95% CI: 1.05, 1.08), while controlling for the other variables.
Those who had been smoking for between 30 to 34 years are at higher risk of having lung cancer with an IRR of 24.7 (95% CI: 5.23, 442), while controlling for the other variables.

We leave the rest of the IRRs for you to interpret. But take note that the IRRs for years of smoking (smoke_yrs) between 30-34 to 55-59 categories are quite large with wide 95% CIs, although this does not seem to be a problem since the standard errors are reasonable for the estimated coefficients (look again at summary(pois_case)). The 95% CIs for 20-24 and 25-29 include 1 (which means no risk) with risks ranging from lower risk (IRR < 1) to higher risk (IRR > 1). This is expected because the P-values for these two categories are not significant.

Then, we view and save the output in the spreadsheet format for later use. We use tidy(),

tib_pois_case = tidy(pois_case_final, exponentiate = TRUE, 
                     conf.int = TRUE)
tib_pois_case

## # A tibble: 10 × 7
##    term              estimate std.error statistic  p.value    conf.low conf.high
##    <chr>                <dbl>     <dbl>     <dbl>    <dbl>       <dbl>     <dbl>
##  1 (Intercept)      0.0000122   1.01      -11.2   2.81e-29     6.89e-7   5.49e-5
##  2 cigar_day        1.07        0.00640    10.1   8.79e-24     1.05e+0   1.08e+0
##  3 smoke_yrs20-24   2.62        1.15        0.833 4.05e- 1     3.35e-1   5.29e+1
##  4 smoke_yrs25-29   5.53        1.08        1.58  1.13e- 1     9.44e-1   1.05e+2
##  5 smoke_yrs30-34  24.7         1.02        3.14  1.67e- 3     5.23e+0   4.42e+2
##  6 smoke_yrs35-39  25.6         1.02        3.17  1.54e- 3     5.35e+0   4.59e+2
##  7 smoke_yrs40-44  67.2         1.01        4.15  3.30e- 5     1.46e+1   1.19e+3
##  8 smoke_yrs45-49  85.5         1.02        4.37  1.22e- 5     1.83e+1   1.52e+3
##  9 smoke_yrs50-54 133.          1.02        4.80  1.60e- 6     2.83e+1   2.38e+3
## 10 smoke_yrs55-59 215.          1.02        5.25  1.52e- 7     4.51e+1   3.86e+3

and export it to a .csv file,

write.csv(tib_pois_case, "tib_pois_case.csv")

Now, we present the model equation, which unfortunately this time quite a lengthy one. We display the coefficients,

round(summary(pois_case_final)$coefficients, 2)

##                Estimate Std. Error z value Pr(>|z|)
## (Intercept)      -11.32       1.01  -11.23     0.00
## cigar_day          0.06       0.01   10.05     0.00
## smoke_yrs20-24     0.96       1.15    0.83     0.40
## smoke_yrs25-29     1.71       1.08    1.58     0.11
## smoke_yrs30-34     3.21       1.02    3.14     0.00
## smoke_yrs35-39     3.24       1.02    3.17     0.00
## smoke_yrs40-44     4.21       1.01    4.15     0.00
## smoke_yrs45-49     4.45       1.02    4.37     0.00
## smoke_yrs50-54     4.89       1.02    4.80     0.00
## smoke_yrs55-59     5.37       1.02    5.25     0.00

and put the values in the equation. Remember to include the offset in the equation.

\[\begin{aligned} ln(case) = &\ ln(person\_yrs) -11.32 + 0.06\times cigar\_day \\ & + 0.96\times smoke\_yrs(20-24) + 1.71\times smoke\_yrs(25-29) \\ & + 3.21\times smoke\_yrs(30-34) + 3.24\times smoke\_yrs(35-39) \\ & + 4.21\times smoke\_yrs(40-44) + 4.45\times smoke\_yrs(45-49) \\ & + 4.89\times smoke\_yrs(50-54) + 5.37\times smoke\_yrs(55-59) \end{aligned}\]

10.6 Quasi-Poisson Regression for Overdispersed Data

We utilized family = "quasipoisson" option in the glm specification before just to easily obtain the scaled Pearson chi-square statistic without knowing what it is. So, what is a quasi-Poisson regression? For a typical Poisson regression analysis, we rely on maximum likelihood estimation method. It assumes that the mean (of the count) and its variance are equal, or variance divided by mean equals 1. For that reason, we expect that scaled Pearson chi-square statistic to be close to 1 so as to indicate good fit of the Poisson regression model.

So what if this assumption of “mean equals variance” is violated? Whenever the variance is larger than the mean for that model, we call this issue overdispersion. In particular, it will affect a Poisson regression model by underestimating the standard errors of the coefficients. So, we may have narrower confidence intervals and smaller P-values (i.e. more likely to have false positive results) than what we could have obtained. In handling the overdispersion issue, one may use a negative binomial regression, which we do not cover in this book. A more flexible option is by using quasi-Poisson regression that relies on quasi-likelihood estimation method (Fleiss, Levin, and Paik 2003).

To demonstrate a quasi-Poisson regression is not difficult because we already did that before when we wanted to obtain scaled Pearson chi-square statistic before in the previous sections. As an example, we repeat the same using the model for count. We fit the standard Poisson regression model,

pois_attack_all1 = glm(attack ~ ., data = asthma, family = "poisson")

Then we fit the same model using quasi-Poisson regression,

qpois_attack_all1 = glm(attack ~ ., data = asthma, 
                        family = "quasipoisson")

Now, pay attention to the standard errors and confidence intervals of each models. Can you spot the differences between the two? (Hints: std.error, p.value, conf.low and conf.high columns).

# Poisson
tidy(pois_attack_all1, conf.int = TRUE)

## # A tibble: 4 × 7
##   term        estimate std.error statistic  p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 (Intercept)  -0.315    0.183      -1.72  8.57e- 2  -0.685     0.0347
## 2 gendermale   -0.0419   0.122      -0.342 7.32e- 1  -0.285     0.196 
## 3 res_infyes    0.426    0.153       2.79  5.28e- 3   0.133     0.733 
## 4 ghq12         0.0495   0.00788     6.28  3.29e-10   0.0342    0.0651

# quasi-Poisson
tidy(qpois_attack_all1, conf.int = TRUE)

## # A tibble: 4 × 7
##   term        estimate std.error statistic      p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>        <dbl>    <dbl>     <dbl>
## 1 (Intercept)  -0.315    0.189      -1.67  0.0979        -0.697     0.0449
## 2 gendermale   -0.0419   0.126      -0.332 0.740         -0.292     0.203 
## 3 res_infyes    0.426    0.157       2.71  0.00779        0.124     0.742 
## 4 ghq12         0.0495   0.00812     6.10  0.0000000143   0.0337    0.0656

Note that there are no changes to the coefficients between the standard Poisson regression and the quasi-Poisson regression. We also interpret the quasi-Poisson regression model output in the same way to that of the standard Poisson regression model output.

10.7 Summary

In this chapter, we went through the basics about Poisson regression for count and rate data. We performed the analysis for each and learned how to assess the model fit for the regression models. We learned how to nicely present and interpret the results. In addition, we also learned how to utilize the model for prediction.To understand more about the concep, analysis workflow and interpretation of count data analysis including Poisson regression, we recommend texts from the Epidemiology: Study Design and Data Analysis book (Woodward 2013) and Regression Models for Categorical Dependent Variables Using Stata book (Long, Freese, and LP. 2006).

References

Doll, Richard. 1971. “The Age Distribution of Cancer: Implications for Models of Carcinogenesis.” Journal of the Royal Statistical Society. Series A (General) 134 (2): 133–55. https://doi.org/10.2307/2343871.

Fleiss, Joseph L, Bruce Levin, and Myunghee Cho Paik. 2003. Statistical Methods for Rates and Proportions. 3rd ed. USA: John Wiley & Sons.

Frome, E. L. 1983. “The Analysis of Rates Using Poisson Regression Models.” Biometrics 39 (3): 665–74. https://doi.org/10.2307/2531094.

Hosmer, D. W., S. Lemeshow, and R. X. Sturdivant. 2013. Applied Logistic Regression. Wiley Series in Probability and Statistics. Wiley. https://books.google.com.my/books?id=bRoxQBIZRd4C.

Long, J. S., J. Freese, and StataCorp LP. 2006. Regression Models for Categorical Dependent Variables Using Stata, Second Edition. A Stata Press Publication. Taylor & Francis. https://books.google.com.my/books?id=kbrIEvo\_zawC.

Woodward, M. 2013. Epidemiology: Study Design and Data Analysis, Third Edition. Chapman & Hall/CRC Texts in Statistical Science. CRC Press. https://books.google.com.my/books?id=VJDSBQAAQBAJ.