Chapter 10 Poisson Regression
10.1 Objectives
After completing this chapter, the readers are expected to
- understand the basic concepts behind Poisson regression for count and rate data
- perform Poisson regression for count and rate
- perform model fit assessment
- present and interpret the results of Poisson regression analyses
10.2 Introduction
Poisson regression is a regression analysis for count and rate data. As mentioned before in Chapter 7, it is is a type of Generalized linear models (GLMs) whenever the outcome is count. It also accommodates rate data as we will see shortly. Although count and rate data are very common in medical and health sciences, in our experience, Poisson regression is underutilized in medical research. Most often, researchers end up using linear regression because they are more familiar with it and lack of exposure to the advantage of using Poisson regression to handle count and rate data.
Count is discrete numerical data. Although it is convenient to use linear regression to handle the count outcome by assuming the count or discrete numerical data (e.g. the number of hospital admissions) as continuous numerical data (e.g. systolic blood pressure in mmHg), it may result in illogical predicted values. For example, by using linear regression to predict the number of asthmatic attacks in the past one year, we may end up with a negative number of attacks, which does not make any clinical sense! So, it is recommended that medical researchers get familiar with Poisson regression and make use of it whenever the outcome variable is a count variable.
Another reason for using Poisson regression is whenever the number of cases (e.g. deaths, accidents) is small relative to the number of no events (e.g. alive, no accident), then it makes more sense to just get the information from the cases in a population of interest, instead of also getting the information from the non-cases as in typical cohort and case-control studies. For example, in the publicly available COVID-19 data, only the number of deaths were reported along with some basic sociodemographic and clinical information for the cases. Whenever the information for the non-cases are available, it is quite easy to instead use logistic regression for the analysis.
Basically, Poisson regression models the linear relationship between:
- outcome: count variable (e.g. the number of hospital admissions, parity, cancerous lesions, asthmatic attacks). This is transformed into the natural log scale.
- predictors/independent variables: numerical variables (e.g. age, blood pressure, income) and categorical variables (e.g. gender, race, education level).
We might be interested in knowing the relationship between the number of asthmatic attacks in the past one year with sociodemographic factors. This relationship can be explored by a Poisson regression analysis.
Basically, for Poisson regression, the relationship between the outcome and predictors is as follows,
\[\begin{aligned} natural\ log\ of\ count\ outcome = &\ numerical\ predictors \\ & + categorical\ predictors \end{aligned}\]
At times, the count is proportional to a denominator. For example, given the same number of deaths, the death rate in a small population will be higher than the rate in a large population. If we were to compare the the number of deaths between the populations, it would not make a fair comparison. Thus, we may consider adding denominators in the Poisson regression modelling in the forms of offsets. This denominator could also be the unit time of exposure, for example person-years of cigarette smoking. This will be explained later under Poisson regression for rate section.
In the previous chapter, we learned that logistic regression allows us to obtain the odds ratio, which is approximately the relative risk given a predictor. For Poisson regression, by taking the exponent of the coefficient, we obtain the rate ratio RR (also known as incidence rate ratio IRR),
\[RR=exp(b_{p})\] for the coefficient \(b_p\) of the p’s predictor. This is interpreted in similar way to the odds ratio for logistic regression, which is approximately the relative risk given a predictor.
10.3 Prepare R Environment for Analysis
10.3.1 Libraries
For this chapter, we will be using the following packages:
- tidyverse: a general and powerful package for data transformation
- psych: for descriptive statistics
- gtsummary: for coming up with nice tables for results
- broom: for tidying up the results
- epiDisplay: an epidemiological data display package
These are loaded as follows using the function library()
,
For epiDisplay
, we will use the package directly using epiDisplay::function_name()
instead.
We did not load the package as we usually do with library(epiDisplay)
because it has some conflicts with the packages we loaded above.
10.4 Poisson regression for Count
10.4.1 About Poisson regression for count
Poisson regression models the linear relationship between:
- outcome: count variable (e.g. the number of hospital admissions, parity, cancerous lesions, asthmatic attacks). This is transformed into the natural log scale.
- predictor(s): numerical variables (e.g. age, blood pressure, income) and categorical variables (e.g. gender, race, education level).
We might be interested in knowing the relationship between the number of asthmatic attacks in the past one year with sociodemographic factors. This relationship can be explored by a Poisson regression analysis.
Multiple Poisson regression for count is given as,
\[\begin{aligned} ln(count\ outcome) = &\ intercept \\ & + coefficients \times numerical\ predictors \\ & + coefficients \times categorical\ predictors \end{aligned}\]
or in a shorter form,
\[ln(\hat y) = b_0 + b_1x_1 + b_2x_2 + ... + b_px_p\] where we have p predictors.
10.4.2 Dataset
The data on the number of asthmatic attacks per year among a sample of 120 patients and the associated factors are given in asthma.csv
.
The dataset contains four variables:
- gender: Gender of the subjects (categorical) {male, female}.
- res_inf: Recurrent respiratory infection (categorical) {no, yes}.
- ghq12: General Health Questionnare 12 (GHQ-12) score of psychological well being (numerical) {0 to 36}.
- attack: Number of athmatic attack per year (count).
The dataset is loaded as follows,
We then look at the basic structure of the dataset,
## 'data.frame': 120 obs. of 4 variables:
## $ gender : chr "female" "male" "male" "female" ...
## $ res_inf: chr "yes" "no" "yes" "yes" ...
## $ ghq12 : int 21 17 30 22 27 33 24 23 25 28 ...
## $ attack : int 6 4 8 5 2 3 2 1 2 2 ...
10.4.3 Data exploration
For descriptive statistics, we introduce the epidisplay
package. It is a nice package that allows us to easily obtain statistics for both numerical and categorical variables at the same time. We use codebook()
function from the package.
##
##
##
## gender :
## A character vector
##
## ==================
## res_inf :
## A character vector
##
## ==================
## ghq12 :
## obs. mean median s.d. min. max.
## 120 16.342 19 9.81 0 33
##
## ==================
## attack :
## obs. mean median s.d. min. max.
## 120 2.458 2 2.012 0 9
##
## ==================
10.4.4 Univariable analysis
For the univariable analysis, we fit univariable Poisson regression models for gender (gender
), recurrent respiratory infection (res_inf
) and GHQ12 (ghq12
) variables.
# gender
pois_attack1 = glm(attack ~ gender, data = asthma, family = "poisson")
summary(pois_attack1)
##
## Call:
## glm(formula = attack ~ gender, family = "poisson", data = asthma)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.02105 0.07332 13.925 <2e-16 ***
## gendermale -0.30000 0.12063 -2.487 0.0129 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 229.56 on 119 degrees of freedom
## Residual deviance: 223.23 on 118 degrees of freedom
## AIC: 500.3
##
## Number of Fisher Scoring iterations: 5
# rec_res_inf
pois_attack2 = glm(attack ~ res_inf, data = asthma, family = "poisson")
summary(pois_attack2)
##
## Call:
## glm(formula = attack ~ res_inf, family = "poisson", data = asthma)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.2877 0.1213 2.372 0.0177 *
## res_infyes 0.9032 0.1382 6.533 6.44e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 229.56 on 119 degrees of freedom
## Residual deviance: 180.49 on 118 degrees of freedom
## AIC: 457.56
##
## Number of Fisher Scoring iterations: 5
##
## Call:
## glm(formula = attack ~ ghq12, family = "poisson", data = asthma)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.230923 0.159128 -1.451 0.147
## ghq12 0.059500 0.006919 8.599 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 229.56 on 119 degrees of freedom
## Residual deviance: 145.13 on 118 degrees of freedom
## AIC: 422.2
##
## Number of Fisher Scoring iterations: 5
From the outputs, all variables are important with P < .25. These variables are the candidates for inclusion in the multivariable analysis. However, as a reminder, in the context of confirmatory research, the variables that we want to include must consider expert judgement.
10.4.5 Multivariable analysis
For the multivariable analysis, we included all variables as predictors of attack
. Here we use dot .
as a shortcut for all variables when specifying the right-hand side of the formula of the glm
.
##
## Call:
## glm(formula = attack ~ ., family = "poisson", data = asthma)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.315387 0.183500 -1.719 0.08566 .
## gendermale -0.041905 0.122469 -0.342 0.73222
## res_infyes 0.426431 0.152859 2.790 0.00528 **
## ghq12 0.049508 0.007878 6.285 3.29e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 229.56 on 119 degrees of freedom
## Residual deviance: 136.68 on 116 degrees of freedom
## AIC: 417.75
##
## Number of Fisher Scoring iterations: 5
From the output, we noted that gender
is not significant with P > 0.05, although it was significant at the univariable analysis. Now, we fit a model excluding gender
,
# minus gender
pois_attack_all1 = glm(attack ~ res_inf + ghq12, data = asthma,
family = "poisson")
summary(pois_attack_all1)
##
## Call:
## glm(formula = attack ~ res_inf + ghq12, family = "poisson", data = asthma)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.34051 0.16823 -2.024 0.04296 *
## res_infyes 0.42816 0.15282 2.802 0.00508 **
## ghq12 0.04989 0.00779 6.404 1.51e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 229.56 on 119 degrees of freedom
## Residual deviance: 136.80 on 117 degrees of freedom
## AIC: 415.86
##
## Number of Fisher Scoring iterations: 5
From the output, both variables are significant predictors of asthmatic attack (or more accurately the natural log of the count of asthmatic attack). This serves as our preliminary model.
10.4.6 Interaction
Now, we include a two-way interaction term between res_inf
and ghq12
.
pois_attack_allx = glm(attack ~ res_inf * ghq12, data = asthma,
family = "poisson")
summary(pois_attack_allx)
##
## Call:
## glm(formula = attack ~ res_inf * ghq12, family = "poisson", data = asthma)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.63436 0.23408 -2.710 0.00673 **
## res_infyes 1.01927 0.32822 3.105 0.00190 **
## ghq12 0.06834 0.01186 5.763 8.29e-09 ***
## res_infyes:ghq12 -0.03135 0.01531 -2.048 0.04056 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 229.56 on 119 degrees of freedom
## Residual deviance: 132.67 on 116 degrees of freedom
## AIC: 413.74
##
## Number of Fisher Scoring iterations: 5
It turns out that the interaction term res_inf * ghq12
is significant. We may include this interaction term in the final model. However, this might complicate our interpretation of the result as we can no longer interpret individual coefficients. Given that the P-value of the interaction term is close to the commonly used significance level of 0.05, we may choose to ignore this interaction. But keep in mind that the decision is yours, the analyst. Having said that, if the purpose of modelling is mainly for prediction, the issue is less severe because we are more concerned with the predicted values than with the clinical interpretation of the result. However, if you insist on including the interaction, it can be done by writing down the equation for the model, substitute the value of res_inf
with yes = 1 or no = 0, and obtain the coefficient for ghq12
. We will see how to do this under Presentation and interpretation below.
10.4.7 Model fit assessment
For Poisson regression, we assess the model fit by chi-square goodness-of-fit test, model-to-model AIC comparison and scaled Pearson chi-square statistic. We also assess the regression diagnostics using standardized residuals.
Chi-square goodness-of-fit
Chi-square goodness-of-fit test can be performed using poisgof()
function in epiDisplay
package. Note that, instead of using Pearson chi-square statistic, it utilizes residual deviance with its respective degrees of freedom (df) (e.g. from the output of summary(pois_attack_all1)
above). A P-value > 0.05 indicates good model fit.
## $results
## [1] "Goodness-of-fit test for Poisson assumption"
##
## $chisq
## [1] 136.7964
##
## $df
## [1] 117
##
## $p.value
## [1] 0.101934
Model-to-model AIC comparison
We may also compare the models that we fit so far by Akaike information criterion (AIC). Recall that R uses AIC for stepwise automatic variable selection, which was explained in Linear Regression chapter.
## df AIC
## pois_attack1 2 500.3009
## pois_attack2 2 457.5555
## pois_attack3 2 422.1997
## pois_attack_all1 3 415.8649
## pois_attack_allx 4 413.7424
The best model is the one with the lowest AIC, which is the model model with the interaction term. However, since the model with the interaction term differ slightly from the model without interaction, we may instead choose the simpler model without the interaction term.
Scaled Pearson chi-square statistic
Pearson chi-square statistic divided by its df gives rise to scaled Pearson chi-square statistic (Fleiss, Levin, and Paik 2003). The closer the value of this statistic to 1, the better is the model fit. First, Pearson chi-square statistic is calculated as,
\[\chi^2_P = \sum_{i=1}^n \frac{(y_i - \hat y_i)^2}{\hat y_i}\] easily obtained in R as below.
Then we obtain scaled Pearson chi-square statistic \(\chi^2_P / df\), where \(df = n - p\). \(n\) is the number of observations nrow(asthma)
and \(p\) is the number of coefficients/parameters we estimated for the model length(pois_attack_all1$coefficients)
.
## [1] 1.052376
The value of sx2
is 1.052, which is close to 1. This indicates good model fit.
It is actually easier to obtain scaled Pearson chi-square by changing the family = "poisson"
to family = "quasipoisson"
in the glm
specification, then viewing the dispersion
value from the summary of the model. It works because scaled Pearson chi-square is an estimator of the overdispersion parameter in a quasi-Poisson regression model (Fleiss, Levin, and Paik 2003). We will discuss about quasi-Poisson regression later towards the end of this chapter.
qpois_attack_all1_summ = summary(glm(attack ~ res_inf + ghq12,
data = asthma, family = "quasipoisson"))
qpois_attack_all1_summ$dispersion
## [1] 1.052383
Regression diagnostics
Here, we use standardized residuals using rstandard()
function. Because it is in form of standardized z score, we may use specific cutoffs to find the outliers, for example 1.96 (for \(\alpha\) = 0.05) or 3.89 (for \(\alpha\) = 0.0001).
## 28 38 95 103
## -2.082287 2.168737 1.977013 2.461419
We now locate where the discrepancies are,
index = names(std_res[abs(std_res) > 1.96])
cbind(asthma[index,], attack_pred = pois_attack_all1$fitted[index])
## gender res_inf ghq12 attack attack_pred
## 28 female yes 29 1 4.6384274
## 38 female yes 26 9 3.9936854
## 95 male no 1 3 0.7477996
## 103 female no 19 6 1.8355524
10.4.8 Presentation and interpretation
Model without interaction
After all these assumption check points, we decide on the final model and rename the model for easier reference.
We use tbl_regression()
to come up with a table for the results. Here, for interpretation, we exponentiate the coefficients to obtain the incidence rate ratio, IRR.
Characteristic | IRR1 | 95% CI1 | p-value |
---|---|---|---|
res_inf | |||
no | — | — | |
yes | 1.53 | 1.14, 2.08 | 0.005 |
ghq12 | 1.05 | 1.04, 1.07 | <0.001 |
1 IRR = Incidence Rate Ratio, CI = Confidence Interval |
Based on this table, we may interpret the results as follows:
- Those with recurrent respiratory infection are at higher risk of having an asthmatic attack with an IRR of 1.53 (95% CI: 1.14, 2.08), while controlling for the effect of GHQ-12 score.
- An increase in GHQ-12 score by one mark increases the risk of having an asthmatic attack by 1.05 (95% CI: 1.04, 1.07), while controlling for the effect of recurrent respiratory infection.
We can also view and save the output in a format suitable for exporting to the spreadsheet format for later use. We use tidy()
function for the job,
## # A tibble: 3 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.711 0.168 -2.02 4.30e- 2 0.505 0.978
## 2 res_infyes 1.53 0.153 2.80 5.08e- 3 1.14 2.08
## 3 ghq12 1.05 0.00779 6.40 1.51e-10 1.04 1.07
and export it to a .csv
file,
Then, we display the coefficients (i.e. without the exponent) and transfer the values into an equation,
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.34 0.17 -2.02 0.04
## res_infyes 0.43 0.15 2.80 0.01
## ghq12 0.05 0.01 6.40 0.00
\[\begin{aligned} ln(attack) = & -0.34 + 0.43\times res\_inf + 0.05\times ghq12 \end{aligned}\]
From the table and equation above, the effect of an increase in GHQ-12 score is by one mark might not be clinically of interest. Let say, as a clinician we want to know the effect of an increase in GHQ-12 score by six marks instead, which is 1/6 of the maximum score of 36. From the coefficient for GHQ-12 of 0.05, the risk is calculated as
\[IRR_{GHQ12\ by\ 6} = exp(0.05\times 6) = 1.35\]
Or we may fit the model again with some adjustment to the data and glm
specification. First, we divide ghq12
values by 6 and save the values into a new variable ghq12_by6
, followed by fitting the model again using the edited data set and new variable,
# Divide ghq12 by 6
asthma_new = asthma %>% mutate(ghq12_by6 = ghq12 / 6)
# Fit the model
pois_attack_final_by6 = glm(attack ~ res_inf + ghq12_by6,
data = asthma_new, family = "poisson")
Now we view the results for the re-fitted model,
## # A tibble: 3 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -0.341 0.168 -2.02 4.30e- 2 -0.682 -0.0222
## 2 res_infyes 0.428 0.153 2.80 5.08e- 3 0.135 0.734
## 3 ghq12_by6 0.299 0.0467 6.40 1.51e-10 0.209 0.392
## # A tibble: 3 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.711 0.168 -2.02 4.30e- 2 0.505 0.978
## 2 res_infyes 1.53 0.153 2.80 5.08e- 3 1.14 2.08
## 3 ghq12_by6 1.35 0.0467 6.40 1.51e-10 1.23 1.48
As compared to the first method that requires multiplying the coefficient manually, the second method is preferable in R as we also get the 95% CI for ghq12_by6
.
Model with interaction
Now we will go through the interpretation of the model with interaction. We display the coefficients for the model with interaction (pois_attack_allx
) and enter the values into an equation,
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.63 0.23 -2.71 0.01
## res_infyes 1.02 0.33 3.11 0.00
## ghq12 0.07 0.01 5.76 0.00
## res_infyes:ghq12 -0.03 0.02 -2.05 0.04
\[\begin{aligned} ln(attack) = & -0.34 + 0.43\times res\_inf + 0.05\times ghq12 \\ & -0.03\times res\_inf\times ghq12 \end{aligned}\]
As we need to interpret the coefficient for ghq12
by the status of res_inf
, we write an equation for each res_inf
status. When res_inf
= 1 (yes),
\[\begin{aligned} ln(attack) = & -0.63 + 1.02\times res\_inf + 0.07\times ghq12 \\ & -0.03\times res\_inf\times ghq12 \\ = & -0.63 + 1.02\times 1 + 0.07\times ghq12 -0.03\times 1\times ghq12 \\ = &\ 0.39 + 0.04\times ghq12 \end{aligned}\]
and when res_inf
= 0 (no),
\[\begin{aligned} ln(attack) = & -0.63 + 1.02\times res\_inf + 0.07\times ghq12 \\ & -0.03\times res\_inf\times ghq12 \\ = & -0.63 + 1.02\times 0 + 0.07\times ghq12 -0.03\times 0\times ghq12 \\ = & -0.63 + 0.07\times ghq12 \end{aligned}\]
Now, based on the equations, we may interpret the results as follows:
- For those with recurrent respiratory infection, an increase in GHQ-12 score by one mark increases the risk of having an asthmatic attack by 1.04 (IRR = exp[0.04]).
- For those without recurrent respiratory infection, an increase in GHQ-12 score by one mark increases the risk of having an asthmatic attack by 1.07 (IRR = exp[0.07]).
Based on these IRRs, the effect of an increase of GHQ-12 score is slightly higher for those without recurrent respiratory infection. However, in comparison to the IRR for an increase in GHQ-12 score by one mark in the model without interaction, with IRR = exp(0.05) = 1.05. So there are minimal differences in the IRR values for GHQ-12 between the models, thus in this case the simpler Poisson regression model without interaction is preferable. But now, you get the idea as to how to interpret the model with an interaction term.
10.4.9 Prediction
We can use the final model above for prediction. Relevant to our data set, we may want to know the expected number of asthmatic attacks per year for a patient with recurrent respiratory infection and GHQ-12 score of 8,
pred = predict(pois_attack_final, list(res_inf = "yes", ghq12 = 8),
type = "response")
round(pred, 1)
## 1
## 1.6
Now, let’s say we want to know the expected number of asthmatic attacks per year for those with and without recurrent respiratory infection for each 12-mark increase in GHQ-12 score.
new_data = data.frame(res_inf = rep(c("yes", "no"), each = 4),
ghq12 = rep(c(0, 12, 24, 36), 2))
new_data$attack_pred = round(predict(pois_attack_final, new_data,
type = "response"), 1)
new_data
## res_inf ghq12 attack_pred
## 1 yes 0 1.1
## 2 yes 12 2.0
## 3 yes 24 3.6
## 4 yes 36 6.6
## 5 no 0 0.7
## 6 no 12 1.3
## 7 no 24 2.4
## 8 no 36 4.3
10.5 Poisson regression for Rate
10.5.1 About Poisson regression for rate
At times, the count is proportional to a denominator. For example, given the same number of deaths, the death rate in a small population will be higher than the rate in a large population. If we were to compare the the number of deaths between the populations, it would not make a fair comparison. Thus, we may consider adding denominators in the Poisson regression modelling in form of offsets. This denominator could also be the unit time of exposure, for example person-years of cigarette smoking.
As mentioned before, counts can be proportional specific denominators, giving rise to rates. We may add the denominators in the Poisson regression modelling as offsets. Again, these denominators could be stratum size or unit time of exposure. Multiple Poisson regression for rate is specified by adding the offset in the form of the natural log of the denominator \(t\). This is given as,
\[ln(\hat y) = ln(t) + b_0 + b_1x_1 + b_2x_2 + ... + b_px_p\]
10.5.2 Dataset
The data on the number of lung cancer cases among doctors, cigarettes per day, years of smoking and the respective person-years at risk of lung cancer are given in smoke.csv
. The person-years variable serves as the offset for our analysis. The original data came from Doll (1971), which were analyzed in the context of Poisson regression by Frome (1983) and Fleiss, Levin, and Paik (2003). The dataset contains four variables:
- smoke_yrs: Years of smoking (categorical) {15-19, 204-24, 25-29, 30-34, 35-39, 40-44, 45-49, 50-54, 55-59}.
- cigar_day: Cigarettes per day (numerical).
- person_yrs: Person-years at risk of lung cancer (numerical).
- case: Number of lung cancer cases (count).
The dataset is loaded as follows,
Then, we look at its data structure,
## 'data.frame': 63 obs. of 4 variables:
## $ smoke_yrs : chr "15-19" "20-24" "25-29" "30-34" ...
## $ cigar_day : num 0 0 0 0 0 0 0 0 0 5.2 ...
## $ person_yrs: int 10366 8162 5969 4496 3512 2201 1421 1121 826 3121 ...
## $ case : int 1 0 0 0 0 0 0 0 2 0 ...
10.5.3 Data exploration
For descriptive statistics, we use epidisplay::codebook
as before.
##
##
##
## smoke_yrs :
## A character vector
##
## ==================
## cigar_day :
## obs. mean median s.d. min. max.
## 63 17.271 15.9 12.913 0 40.8
##
## ==================
## person_yrs :
## obs. mean median s.d. min. max.
## 63 2426.444 1826 2030.143 104 10366
##
## ==================
## case :
## obs. mean median s.d. min. max.
## 63 2.698 1 3.329 0 12
##
## ==================
In addition, we are also interested to look at the observed rates,
## smoke_yrs cigar_day person_yrs case rate
## 1 15-19 0.0 10366 1 0.0001
## 2 20-24 0.0 8162 0 0.0000
## 3 25-29 0.0 5969 0 0.0000
## 4 30-34 0.0 4496 0 0.0000
## 5 35-39 0.0 3512 0 0.0000
## 6 40-44 0.0 2201 0 0.0000
## 7 45-49 0.0 1421 0 0.0000
## 8 50-54 0.0 1121 0 0.0000
## 9 55-59 0.0 826 2 0.0024
## 10 15-19 5.2 3121 0 0.0000
## 11 20-24 5.2 2937 0 0.0000
## 12 25-29 5.2 2288 0 0.0000
## 13 30-34 5.2 2015 0 0.0000
## 14 35-39 5.2 1648 1 0.0006
## 15 40-44 5.2 1310 2 0.0015
## 16 45-49 5.2 927 0 0.0000
## 17 50-54 5.2 710 3 0.0042
## 18 55-59 5.2 606 0 0.0000
## 19 15-19 11.2 3577 0 0.0000
## 20 20-24 11.2 3286 1 0.0003
## 21 25-29 11.2 2546 1 0.0004
## 22 30-34 11.2 2219 2 0.0009
## 23 35-39 11.2 1826 0 0.0000
## 24 40-44 11.2 1386 1 0.0007
## 25 45-49 11.2 988 2 0.0020
## 26 50-54 11.2 684 4 0.0058
## 27 55-59 11.2 449 3 0.0067
## 28 15-19 15.9 4317 0 0.0000
## 29 20-24 15.9 4214 0 0.0000
## 30 25-29 15.9 3185 0 0.0000
## 31 30-34 15.9 2560 4 0.0016
## 32 35-39 15.9 1893 0 0.0000
## 33 40-44 15.9 1334 2 0.0015
## 34 45-49 15.9 849 2 0.0024
## 35 50-54 15.9 470 2 0.0043
## 36 55-59 15.9 280 5 0.0179
## 37 15-19 20.4 5683 0 0.0000
## 38 20-24 20.4 6385 1 0.0002
## 39 25-29 20.4 5483 1 0.0002
## 40 30-34 20.4 4687 6 0.0013
## 41 35-39 20.4 3646 5 0.0014
## 42 40-44 20.4 2411 12 0.0050
## 43 45-49 20.4 1567 9 0.0057
## 44 50-54 20.4 857 7 0.0082
## 45 55-59 20.4 416 7 0.0168
## 46 15-19 27.4 3042 0 0.0000
## 47 20-24 27.4 4050 1 0.0002
## 48 25-29 27.4 4290 4 0.0009
## 49 30-34 27.4 4268 9 0.0021
## 50 35-39 27.4 3529 9 0.0026
## 51 40-44 27.4 2424 11 0.0045
## 52 45-49 27.4 1409 10 0.0071
## 53 50-54 27.4 663 5 0.0075
## 54 55-59 27.4 284 3 0.0106
## 55 15-19 40.8 670 0 0.0000
## 56 20-24 40.8 1166 0 0.0000
## 57 25-29 40.8 1482 0 0.0000
## 58 30-34 40.8 1580 4 0.0025
## 59 35-39 40.8 1336 6 0.0045
## 60 40-44 40.8 924 10 0.0108
## 61 45-49 40.8 556 7 0.0126
## 62 50-54 40.8 255 4 0.0157
## 63 55-59 40.8 104 1 0.0096
10.5.4 Univariable analysis
For the univariable analysis, we fit univariable Poisson regression models for cigarettes per day (cigar_day
), and years of smoking (smoke_yrs
) variables. Offset or denominator is included as offset = log(person_yrs)
in the glm
option.
# cigar_day
pois_case1 = glm(case ~ cigar_day, data = smoke, family = "poisson",
offset = log(person_yrs))
summary(pois_case1)
##
## Call:
## glm(formula = case ~ cigar_day, family = "poisson", data = smoke,
## offset = log(person_yrs))
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.159268 0.175063 -46.61 <2e-16 ***
## cigar_day 0.070359 0.006468 10.88 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 445.10 on 62 degrees of freedom
## Residual deviance: 324.71 on 61 degrees of freedom
## AIC: 448.55
##
## Number of Fisher Scoring iterations: 6
# smoke_yrs
pois_case2 = glm(case ~ smoke_yrs, data = smoke, family = "poisson",
offset = log(person_yrs))
summary(pois_case2)
##
## Call:
## glm(formula = case ~ smoke_yrs, family = "poisson", data = smoke,
## offset = log(person_yrs))
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -10.335 1.000 -10.335 < 2e-16 ***
## smoke_yrs20-24 1.117 1.155 0.968 0.333149
## smoke_yrs25-29 1.990 1.080 1.842 0.065423 .
## smoke_yrs30-34 3.563 1.020 3.493 0.000477 ***
## smoke_yrs35-39 3.615 1.024 3.532 0.000412 ***
## smoke_yrs40-44 4.580 1.013 4.521 6.15e-06 ***
## smoke_yrs45-49 4.785 1.016 4.707 2.52e-06 ***
## smoke_yrs50-54 5.085 1.020 4.987 6.14e-07 ***
## smoke_yrs55-59 5.384 1.024 5.261 1.44e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 445.10 on 62 degrees of freedom
## Residual deviance: 170.17 on 54 degrees of freedom
## AIC: 308.01
##
## Number of Fisher Scoring iterations: 5
From the outputs, all variables including the dummy variables are important with P-values < .25.
10.5.5 Multivariable analysis
For the multivariable analysis, we included cigar_day
and smoke_yrs
as predictors of case
.
pois_case = glm(case ~ cigar_day + smoke_yrs, data = smoke,
family = "poisson", offset = log(person_yrs))
summary(pois_case)
##
## Call:
## glm(formula = case ~ cigar_day + smoke_yrs, family = "poisson",
## data = smoke, offset = log(person_yrs))
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -11.317162 1.007492 -11.233 < 2e-16 ***
## cigar_day 0.064361 0.006401 10.054 < 2e-16 ***
## smoke_yrs20-24 0.962417 1.154694 0.833 0.40457
## smoke_yrs25-29 1.710894 1.080420 1.584 0.11330
## smoke_yrs30-34 3.207676 1.020378 3.144 0.00167 **
## smoke_yrs35-39 3.242776 1.024187 3.166 0.00154 **
## smoke_yrs40-44 4.208361 1.013726 4.151 3.30e-05 ***
## smoke_yrs45-49 4.448972 1.017054 4.374 1.22e-05 ***
## smoke_yrs50-54 4.893674 1.019945 4.798 1.60e-06 ***
## smoke_yrs55-59 5.372600 1.023404 5.250 1.52e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 445.099 on 62 degrees of freedom
## Residual deviance: 68.015 on 53 degrees of freedom
## AIC: 207.86
##
## Number of Fisher Scoring iterations: 5
From the output, both variables are significant predictors of the rate of lung cancer cases, although we noted the P-values are not significant for smoke_yrs20-24
and smoke_yrs25-29
dummy variables. This model serves as our preliminary model.
10.5.6 Interaction
Now, we include a two-way interaction term between cigar_day
and smoke_yrs
. From the output, although we noted that the interaction terms are not significant, the standard errors for cigar_day
and the interaction terms are extremely large. This might point to a numerical issue with the model (Hosmer, Lemeshow, and Sturdivant 2013). So, we may drop the interaction term from our model.
pois_casex = glm(case ~ cigar_day * smoke_yrs, data = smoke,
family = "poisson", offset = log(person_yrs))
## Warning: glm.fit: fitted rates numerically 0 occurred
##
## Call:
## glm(formula = case ~ cigar_day * smoke_yrs, family = "poisson",
## data = smoke, offset = log(person_yrs))
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.2463 1.0000 -9.246 < 2e-16 ***
## cigar_day -2.8137 319.6351 -0.009 0.992976
## smoke_yrs20-24 -0.7514 1.5114 -0.497 0.619084
## smoke_yrs25-29 -0.2539 1.3518 -0.188 0.851030
## smoke_yrs30-34 1.2493 1.1032 1.132 0.257489
## smoke_yrs35-39 0.5856 1.1660 0.502 0.615498
## smoke_yrs40-44 1.9706 1.0790 1.826 0.067800 .
## smoke_yrs45-49 2.1070 1.0982 1.919 0.055044 .
## smoke_yrs50-54 3.0710 1.0761 2.854 0.004318 **
## smoke_yrs55-59 3.5704 1.0706 3.335 0.000853 ***
## cigar_day:smoke_yrs20-24 2.8609 319.6351 0.009 0.992859
## cigar_day:smoke_yrs25-29 2.8736 319.6351 0.009 0.992827
## cigar_day:smoke_yrs30-34 2.8736 319.6351 0.009 0.992827
## cigar_day:smoke_yrs35-39 2.8997 319.6351 0.009 0.992762
## cigar_day:smoke_yrs40-44 2.8845 319.6351 0.009 0.992800
## cigar_day:smoke_yrs45-49 2.8886 319.6351 0.009 0.992790
## cigar_day:smoke_yrs50-54 2.8669 319.6351 0.009 0.992844
## cigar_day:smoke_yrs55-59 2.8641 319.6351 0.009 0.992851
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 445.099 on 62 degrees of freedom
## Residual deviance: 60.597 on 45 degrees of freedom
## AIC: 216.44
##
## Number of Fisher Scoring iterations: 18
10.5.7 Model fit assessment
Again, we assess the model fit by chi-square goodness-of-fit test, model-to-model AIC comparison and scaled Pearson chi-square statistic and standardized residuals.
## $results
## [1] "Goodness-of-fit test for Poisson assumption"
##
## $chisq
## [1] 68.01514
##
## $df
## [1] 53
##
## $p.value
## [1] 0.0802866
The P-value of chi-square goodness-of-fit is more than 0.05, which indicates the model has good fit.
## df AIC
## pois_case1 2 448.5549
## pois_case2 9 308.0129
## pois_case 10 207.8574
The comparison by AIC clearly shows that the multivariable model pois_case
is the best model as it has the lowest AIC value.
# Scaled Pearson chi-square statistic using quasipoisson
qpois_case_summ = summary(glm(case ~ cigar_day + smoke_yrs, data = smoke,
family = "quasipoisson",
offset = log(person_yrs)))
qpois_case_summ$dispersion
## [1] 1.072516
The value of dispersion i.e. the scaled Pearson chi-square statistic is close to 1. This again indicates that the model has good fit.
## 6 8 18
## -1.995530 -2.025055 -2.253832
index = names(std_res[abs(std_res) > 1.96])
# points of discrepancies
cbind(smoke[index,], case_pred = pois_case$fitted[index]) %>%
mutate(rate = round(case/person_yrs, 4),
rate_pred = round(case_pred/person_yrs, 4))
## smoke_yrs cigar_day person_yrs case case_pred rate rate_pred
## 6 40-44 0.0 2201 0 1.800143 0 0.0008
## 8 50-54 0.0 1121 0 1.819366 0 0.0016
## 18 55-59 5.2 606 0 2.218868 0 0.0037
Lastly, we noted only a few observations (number 6, 8 and 18) have discrepancies between the observed and predicted cases. This shows how well the fitted Poisson regression model for rate explains the data at hand.
10.5.8 Presentation and interpretation
Now, we decide on the final model,
and use tbl_regression()
to come up with a table for the results. Again, for interpretation, we exponentiate the coefficients to obtain the incidence rate ratio, IRR.
Characteristic | IRR1 | 95% CI1 | p-value |
---|---|---|---|
cigar_day | 1.07 | 1.05, 1.08 | <0.001 |
smoke_yrs | |||
15-19 | — | — | |
20-24 | 2.62 | 0.34, 52.9 | 0.4 |
25-29 | 5.53 | 0.94, 105 | 0.11 |
30-34 | 24.7 | 5.23, 442 | 0.002 |
35-39 | 25.6 | 5.35, 459 | 0.002 |
40-44 | 67.2 | 14.6, 1,195 | <0.001 |
45-49 | 85.5 | 18.3, 1,525 | <0.001 |
50-54 | 133 | 28.3, 2,385 | <0.001 |
55-59 | 215 | 45.1, 3,862 | <0.001 |
1 IRR = Incidence Rate Ratio, CI = Confidence Interval |
From this table, we interpret the IRR values as follows:
- Taking an additional cigarette per day increases the risk of having lung cancer by 1.07 (95% CI: 1.05, 1.08), while controlling for the other variables.
- Those who had been smoking for between 30 to 34 years are at higher risk of having lung cancer with an IRR of 24.7 (95% CI: 5.23, 442), while controlling for the other variables.
We leave the rest of the IRRs for you to interpret. But take note that the IRRs for years of smoking (smoke_yrs
) between 30-34
to 55-59
categories are quite large with wide 95% CIs, although this does not seem to be a problem since the standard errors are reasonable for the estimated coefficients (look again at summary(pois_case)
). The 95% CIs for 20-24
and 25-29
include 1 (which means no risk) with risks ranging from lower risk (IRR < 1) to higher risk (IRR > 1). This is expected because the P-values for these two categories are not significant.
Then, we view and save the output in the spreadsheet format for later use. We use tidy()
,
## # A tibble: 10 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.0000122 1.01 -11.2 2.81e-29 6.89e-7 5.49e-5
## 2 cigar_day 1.07 0.00640 10.1 8.79e-24 1.05e+0 1.08e+0
## 3 smoke_yrs20-24 2.62 1.15 0.833 4.05e- 1 3.35e-1 5.29e+1
## 4 smoke_yrs25-29 5.53 1.08 1.58 1.13e- 1 9.44e-1 1.05e+2
## 5 smoke_yrs30-34 24.7 1.02 3.14 1.67e- 3 5.23e+0 4.42e+2
## 6 smoke_yrs35-39 25.6 1.02 3.17 1.54e- 3 5.35e+0 4.59e+2
## 7 smoke_yrs40-44 67.2 1.01 4.15 3.30e- 5 1.46e+1 1.19e+3
## 8 smoke_yrs45-49 85.5 1.02 4.37 1.22e- 5 1.83e+1 1.52e+3
## 9 smoke_yrs50-54 133. 1.02 4.80 1.60e- 6 2.83e+1 2.38e+3
## 10 smoke_yrs55-59 215. 1.02 5.25 1.52e- 7 4.51e+1 3.86e+3
and export it to a .csv
file,
Now, we present the model equation, which unfortunately this time quite a lengthy one. We display the coefficients,
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -11.32 1.01 -11.23 0.00
## cigar_day 0.06 0.01 10.05 0.00
## smoke_yrs20-24 0.96 1.15 0.83 0.40
## smoke_yrs25-29 1.71 1.08 1.58 0.11
## smoke_yrs30-34 3.21 1.02 3.14 0.00
## smoke_yrs35-39 3.24 1.02 3.17 0.00
## smoke_yrs40-44 4.21 1.01 4.15 0.00
## smoke_yrs45-49 4.45 1.02 4.37 0.00
## smoke_yrs50-54 4.89 1.02 4.80 0.00
## smoke_yrs55-59 5.37 1.02 5.25 0.00
and put the values in the equation. Remember to include the offset in the equation.
\[\begin{aligned} ln(case) = &\ ln(person\_yrs) -11.32 + 0.06\times cigar\_day \\ & + 0.96\times smoke\_yrs(20-24) + 1.71\times smoke\_yrs(25-29) \\ & + 3.21\times smoke\_yrs(30-34) + 3.24\times smoke\_yrs(35-39) \\ & + 4.21\times smoke\_yrs(40-44) + 4.45\times smoke\_yrs(45-49) \\ & + 4.89\times smoke\_yrs(50-54) + 5.37\times smoke\_yrs(55-59) \end{aligned}\]
10.6 Quasi-Poisson Regression for Overdispersed Data
We utilized family = "quasipoisson"
option in the glm
specification before just to easily obtain the scaled Pearson chi-square statistic without knowing what it is. So, what is a quasi-Poisson regression? For a typical Poisson regression analysis, we rely on maximum likelihood estimation method. It assumes that the mean (of the count) and its variance are equal, or variance divided by mean equals 1. For that reason, we expect that scaled Pearson chi-square statistic to be close to 1 so as to indicate good fit of the Poisson regression model.
So what if this assumption of “mean equals variance” is violated? Whenever the variance is larger than the mean for that model, we call this issue overdispersion. In particular, it will affect a Poisson regression model by underestimating the standard errors of the coefficients. So, we may have narrower confidence intervals and smaller P-values (i.e. more likely to have false positive results) than what we could have obtained. In handling the overdispersion issue, one may use a negative binomial regression, which we do not cover in this book. A more flexible option is by using quasi-Poisson regression that relies on quasi-likelihood estimation method (Fleiss, Levin, and Paik 2003).
To demonstrate a quasi-Poisson regression is not difficult because we already did that before when we wanted to obtain scaled Pearson chi-square statistic before in the previous sections. As an example, we repeat the same using the model for count. We fit the standard Poisson regression model,
Then we fit the same model using quasi-Poisson regression,
Now, pay attention to the standard errors and confidence intervals of each models. Can you spot the differences between the two? (Hints: std.error
, p.value
, conf.low
and conf.high
columns).
## # A tibble: 4 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -0.315 0.183 -1.72 8.57e- 2 -0.685 0.0347
## 2 gendermale -0.0419 0.122 -0.342 7.32e- 1 -0.285 0.196
## 3 res_infyes 0.426 0.153 2.79 5.28e- 3 0.133 0.733
## 4 ghq12 0.0495 0.00788 6.28 3.29e-10 0.0342 0.0651
## # A tibble: 4 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -0.315 0.189 -1.67 0.0979 -0.697 0.0449
## 2 gendermale -0.0419 0.126 -0.332 0.740 -0.292 0.203
## 3 res_infyes 0.426 0.157 2.71 0.00779 0.124 0.742
## 4 ghq12 0.0495 0.00812 6.10 0.0000000143 0.0337 0.0656
Note that there are no changes to the coefficients between the standard Poisson regression and the quasi-Poisson regression. We also interpret the quasi-Poisson regression model output in the same way to that of the standard Poisson regression model output.
10.7 Summary
In this chapter, we went through the basics about Poisson regression for count and rate data. We performed the analysis for each and learned how to assess the model fit for the regression models. We learned how to nicely present and interpret the results. In addition, we also learned how to utilize the model for prediction.To understand more about the concep, analysis workflow and interpretation of count data analysis including Poisson regression, we recommend texts from the Epidemiology: Study Design and Data Analysis book (Woodward 2013) and Regression Models for Categorical Dependent Variables Using Stata book (Long, Freese, and LP. 2006).