Chapter 10 Solutions to Selected Exercises

10.1 Chapter 1

Question 5

  1. \(y_i =\) weight of the \(i\)-th person, \(x_{i1} =\) their age, \(x_{i2} =\) sex, \(x_{i3} =\) height, \(x_{i4} =\) mean daily food intake, \(x_{i5} =\) mean daily energy expenditure. Distribution of \(y_i\) could be Gaussian. Linear component is \(\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 x_{i3} + \beta_4 x_{i4} + \beta_5 x_{i5}\).

  2. \(y_i =\) proportion of infected mice in 20 mice in one experiment, \(x_{ij} =\) indicator variable for exposure at level \(j\), with \(j = 1, 2, 3, 4\). Distribution of \(y_i\) could be rescaled binomial. Linear component is \(\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 x_{i3} + \beta_4 x_{i4}\).

  3. \(y_i =\) number of trips to the supermarket per week, \(x_{i1} =\) number of people in household, \(x_{i2} =\) household income, \(x_{i3} =\) distance to the supermarket. Distribution of \(y_i\) could be Poisson. Linear component is \(\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 x_{i3}\).

Question 6

The grouped data with \(n = 7\):

Number of trips Size Council tax band Car \(m_i\) \(y_i\) \(\phi_i\)
{3, 2} 1 C 1 2 2.5 1/2
{1, 2} 2 B 0 2 1.5 1/2
{1, 1, 2} 1 A 0 3 1.33 1/3
{3, 2} 4 B 0 2 2.5 1/2
{4} 3 C 1 1 4 1
{2} 3 C 0 1 2 1
{3} 4 C 1 1 3 1

10.2 Chapter 2

Question 1b

Question 2

  1. This link function tries to capture \(E[y_i] = \mu_i = (\boldsymbol{\beta}^T \boldsymbol{x}_i)^2\).
  2. \(\displaystyle S(\boldsymbol{\beta}) = \frac{2}{\sigma^2} \sum_i \left( y_i - (\boldsymbol{\beta}^T \boldsymbol{x}_i)^2 \right) (\boldsymbol{\beta}^T \boldsymbol{x}_i) \boldsymbol{x}_i\).

Question 3

  1. First we identify the EDF components. An exponential distribution is an EDF with \(\phi = 1\), \(\theta = -\lambda\), and \(b(\theta) = -\ln(-\theta)\). This gives \(\mu = -1/\theta = 1/\lambda\) and \(\text{Var}[y] = 1/\theta^{2} = \mu^{2}\) as expected. Recall that \(\text{Var}[y] = \phi \;\mathcal{V}(\mu)\), defining \(\mathcal{V}\). The response function \(h\) is \(\mu = h(\eta) = e^{\eta}\), so that \(\mathcal{V}(\mu) = \mu^{2} = e^{2\eta}\). Putting all this together in the standard formula gives

\[ S(\boldsymbol{\beta}) = \sum_i (y_i - e^{\boldsymbol{\beta}^T \boldsymbol{x}_i}) \;e^{-2 \boldsymbol{\beta}^T \boldsymbol{x}_i} \;e^{\boldsymbol{\beta}^T \boldsymbol{x}_i} \boldsymbol{x}_i = \sum_i \Biggl( {y_i \over e^{\boldsymbol{\beta}^T \boldsymbol{x}_i}} - 1 \Biggr) \boldsymbol{x}_i. \]

  1. Differentiating gives

\[ F_{\text{obs}}(\boldsymbol{\beta}) = -{\partial S\over \partial\boldsymbol{\beta}} = \sum_i {y_i \over e^{\boldsymbol{\beta}^T \boldsymbol{x}_i}} \;\boldsymbol{x}_{i} \boldsymbol{x}_{i}^{T}. \]

  1. Expectation will only act on the \(y_i\) in each term, giving \(\mu_i\), so

\[ F(\boldsymbol{\beta}) = \sum_i {\mu_i \over e^{\boldsymbol{\beta}^T \boldsymbol{x}_i}} \;\boldsymbol{x}_i \boldsymbol{x}_i^T = \sum_i \boldsymbol{x}_i \boldsymbol{x}_i^T. \]

Notice that this is not even a function of \(\boldsymbol{\beta}\).

Question 5

  1. We have \(\eta_i = g(\mu_i) = \log(\mu_i) = \log(n_i \lambda_i) = \log n_i + \boldsymbol{\beta}^T \boldsymbol{x}_i\). We can also think of this model as a normal Poisson GLM with \(\eta_i = \tilde{\boldsymbol{\beta}}^T \tilde{\boldsymbol{x}}_i\), where \(\tilde{\boldsymbol{\beta}} = \begin{pmatrix} 1 \\ \boldsymbol{\beta} \end{pmatrix}\) and \(\tilde{\boldsymbol{x}}_i = \begin{pmatrix} \log n_i \\ \boldsymbol{x}_i \end{pmatrix}\). See Section 3.5.3 in the textbook for an example of this model.

  2. This model is similar to a normal Poisson GLM but with an additional \(\log n_i\) term in the linear predictor, which does not affect the general forms of the score function and Fisher information (you should verify this). Thus, we can use the expressions for the score function and Fisher information in the lecture notes. Assume the data is ungrouped (it is also unrealistic to have grouped data here since we often cannot get several responses for each population). Since log is the natural link for this model, we have:

\[ S(\boldsymbol{\beta}) = \sum_i (y_i - \mu_i) \boldsymbol{x}_i = \sum_i \left( y_i - n_i \exp(\boldsymbol{\beta}^T \boldsymbol{x}_i) \right) \boldsymbol{x}_i. \] Furthermore, for the natural link, the observed Fisher information and expected Fisher information are the same, with

\[ F_{\text{obs}}(\boldsymbol{\beta}) = F(\boldsymbol{\beta}) = \sum_i h'(\eta_i) \boldsymbol{x}_i\boldsymbol{x}_i^T = \sum_i n_i \exp(\boldsymbol{\beta}^T \boldsymbol{x}_i) \boldsymbol{x}_i\boldsymbol{x}_i^T. \]

Question 6

The score function and Fisher information in matrix notation are:

\[\begin{align} \boldsymbol{S} &= \boldsymbol{X}^{T} \boldsymbol{D} \boldsymbol{\Sigma}^{-1} (\boldsymbol{Y} - \boldsymbol{\mu}) \\ \boldsymbol{F} &= \boldsymbol{X}^{T}\boldsymbol{D}^{T}\boldsymbol{\Sigma}^{-1}\boldsymbol{D}\boldsymbol{X} \end{align}\]

where

\[\begin{align} \boldsymbol{X} &= \begin{pmatrix} 1 & 0 \\ 1 & 1 \\ 1 & 2 \\ 1 & 3 \end{pmatrix} \\ \boldsymbol{Y} &= \begin{pmatrix} 12 \\ 10 \\ 8 \\ 7 \end{pmatrix} \\ \boldsymbol{\mu} &= \begin{pmatrix} e^{\beta_0} \\ e^{\beta_0 + \beta_1} \\ e^{\beta_0 + 2 \beta_1} \\ e^{\beta_0 + 3 \beta_1} \end{pmatrix} \\ \boldsymbol{D} &= \begin{pmatrix} e^{\beta_0} & 0 & 0 & 0 \\ 0 & e^{\beta_0 + \beta_1} & 0 & 0 \\ 0 & 0 & e^{\beta_0 + 2 \beta_1} & 0 \\ 0 & 0 & 0 & e^{\beta_0 + 3 \beta_1} \end{pmatrix} \\ \boldsymbol{\Sigma} &= \begin{pmatrix} \frac{e^{\beta_0}}{20} & 0 & 0 & 0 \\ 0 & \frac{e^{\beta_0 + \beta_1}}{20} & 0 & 0 \\ 0 & 0 & \frac{e^{\beta_0 + 2 \beta_1}}{10} & 0 \\ 0 & 0 & 0 & \frac{e^{\beta_0 + 3 \beta_1}}{10} \end{pmatrix}. \end{align}\]

Furthermore, since log link is the natural link and \(\phi=1\) for the Poisson GLM, we can further simplify:

\[\begin{align} \boldsymbol{S} &= \boldsymbol{X}^{T} \boldsymbol{G} (\boldsymbol{Y} - \boldsymbol{\mu}) \\ \boldsymbol{F} & = \boldsymbol{X}^{T}\boldsymbol{G}^{T}\boldsymbol{\Sigma}\boldsymbol{G}\boldsymbol{X} \end{align}\]

where \(\boldsymbol{G}\) is the grouping matrix: \[ \boldsymbol{G} = \begin{pmatrix} 20 & 0 & 0 & 0 \\ 0 & 20 & 0 & 0 \\ 0 & 0 & 10 & 0 \\ 0 & 0 & 0 & 10 \end{pmatrix}. \]

Question 9

  1. From \(\boldsymbol{S} = \boldsymbol{X}^{T} \boldsymbol{D} \boldsymbol{\Sigma}^{-1} (\boldsymbol{Y} - \boldsymbol{\mu})\), we have:

\[\begin{align} E[\boldsymbol{S}] = \boldsymbol{X}^{T} \boldsymbol{D} \boldsymbol{\Sigma}^{-1} (E[\boldsymbol{Y}] - \boldsymbol{\mu}) = \boldsymbol{X}^{T} \boldsymbol{D} \boldsymbol{\Sigma}^{-1} (\boldsymbol{\mu} - \boldsymbol{\mu}) = 0 \end{align}\]

and

\[\begin{align} \text{Var}[\boldsymbol{S}] &= \boldsymbol{X}^{T} \boldsymbol{D} \boldsymbol{\Sigma}^{-1} \text{Var}(\boldsymbol{Y} - \boldsymbol{\mu}) \boldsymbol{\Sigma}^{-1} \boldsymbol{D} \boldsymbol{X} \\ &= \boldsymbol{X}^{T} \boldsymbol{D} \boldsymbol{\Sigma}^{-1} \text{Var}(\boldsymbol{Y}) \boldsymbol{\Sigma}^{-1} \boldsymbol{D} \boldsymbol{X} \\ &= \boldsymbol{X}^{T} \boldsymbol{D} \boldsymbol{\Sigma}^{-1} \boldsymbol{\Sigma} \boldsymbol{\Sigma}^{-1} \boldsymbol{D} \boldsymbol{X} \\ &= \boldsymbol{X}^{T} \boldsymbol{D} \boldsymbol{\Sigma}^{-1} \boldsymbol{D} \boldsymbol{X}. \end{align}\]

  1. Taking the derivative of \(\boldsymbol{S}\) and applying the product rule for derivatives, we have:

\[\begin{align} \frac{\partial \boldsymbol{S}}{\partial \boldsymbol{\beta}} &= \boldsymbol{X}^{T} \frac{\partial }{\partial \boldsymbol{\beta}}(\boldsymbol{D} \boldsymbol{\Sigma}^{-1}) (\boldsymbol{Y} - \boldsymbol{\mu}) + \boldsymbol{X}^{T} \boldsymbol{D} \boldsymbol{\Sigma}^{-1} \frac{\partial }{\partial \boldsymbol{\beta}} (\boldsymbol{Y} - \boldsymbol{\mu}) \\ &= \boldsymbol{X}^{T} \frac{\partial }{\partial \boldsymbol{\beta}}(\boldsymbol{D} \boldsymbol{\Sigma}^{-1}) (\boldsymbol{Y} - \boldsymbol{\mu}) - \boldsymbol{X}^{T} \boldsymbol{D} \boldsymbol{\Sigma}^{-1} \frac{\partial \boldsymbol{\mu}}{\partial \boldsymbol{\beta}}. \end{align}\]

Thus,

\[\begin{align} \boldsymbol{F} = E \left[- \frac{\partial \boldsymbol{S}}{\partial \boldsymbol{\beta}} \right] &= - E \left[ \boldsymbol{X}^{T} \frac{\partial }{\partial \boldsymbol{\beta}}(\boldsymbol{D} \boldsymbol{\Sigma}^{-1}) ( \boldsymbol{Y} - \boldsymbol{\mu} ) \right] + E \left[ \boldsymbol{X}^{T} \boldsymbol{D} \boldsymbol{\Sigma}^{-1} \frac{\partial \boldsymbol{\mu}}{\partial \boldsymbol{\beta}} \right] \\ &= - \boldsymbol{X}^{T} \frac{\partial }{\partial \boldsymbol{\beta}}( \boldsymbol{D} \boldsymbol{\Sigma}^{-1}) E [ \boldsymbol{Y} - \boldsymbol{\mu} ] + \boldsymbol{X}^{T} \boldsymbol{D} \boldsymbol{\Sigma}^{-1} \frac{\partial \boldsymbol{\mu}}{\partial \boldsymbol{\beta}} \\ &= \boldsymbol{X}^{T} \boldsymbol{D} \boldsymbol{\Sigma}^{-1} \frac{\partial \boldsymbol{\mu}}{\partial \boldsymbol{\eta}} \frac{\partial \boldsymbol{\eta}}{\partial \boldsymbol{\beta}} \\ &= \boldsymbol{X}^{T} \boldsymbol{D} \boldsymbol{\Sigma}^{-1} \boldsymbol{D} \boldsymbol{X}. \end{align}\]

10.3 Chapter 3

Question 2a

Let \(x_{i1} = {\tt time}_i\), \(x_{i2} = {\tt I(cos(2*pi*time/12))}_i\), and \(x_{i3} = {\tt I(sin(2*pi*time/12))}_i\), where the subscript \(i\) indicates the \(i\)-th data point. We perform the test with

\[ \begin{aligned} & \mathcal{H}_0: \eta_i = \beta_0 + \beta_1 x_{i1} \\ \text{against } \quad & \mathcal{H}_1: \eta_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 x_{i3}. \end{aligned} \] Let \(\boldsymbol{\beta} = (\beta_0, \beta_1, \beta_2, \beta_3)^T\), then

\[ \begin{aligned} C & = \begin{pmatrix} 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{pmatrix} \\ \gamma & = \begin{pmatrix} 0 \\ 0 \end{pmatrix}. \end{aligned} \]

The variance that we are looking for is:

\[ C \, F(\hat{\boldsymbol\beta})^{-1} \, C^T = C \, \text{Var} [ \hat{\boldsymbol\beta} ] \, C^T = \text{Var} \left[ \begin{pmatrix} \hat{\beta_2} \\ \hat{\beta_3} \end{pmatrix} \right]. \]

Thus, the Wald statistic is:

\[ \begin{aligned} W &= (C \hat{\boldsymbol\beta} - \gamma)^T \left( C F(\hat{\boldsymbol\beta})^{-1} C^T \right)^{-1} (C \hat{\boldsymbol\beta} - \gamma) \\ &= (\hat{\beta_2} \;\; \hat{\beta_3}) \, \text{Var} \left[ \begin{pmatrix} \hat{\beta_2} \\ \hat{\beta_3} \end{pmatrix} \right]^{-1} \begin{pmatrix} \hat{\beta_2} \\ \hat{\beta_3} \end{pmatrix}. \end{aligned} \]

From R, we can get \((\hat{\beta_2} \;\; \hat{\beta_3})\) and the corresponding covariance matrix by:

( c_betahat <- polio1.glm$coef[3:4] )
## I(cos(2 * pi * time/12)) I(sin(2 * pi * time/12)) 
##                0.1812537               -0.4231874
( c_varhat <- summary(polio1.glm)$cov.scaled[3:4, 3:4] )
##                          I(cos(2 * pi * time/12)) I(sin(2 * pi * time/12))
## I(cos(2 * pi * time/12))             0.0092468054            -0.0002083338
## I(sin(2 * pi * time/12))            -0.0002083338             0.0095238337

Thus, the Wald statistic can be computed by:

( w <- c_betahat %*% solve(c_varhat) %*% c_betahat )
##          [,1]
## [1,] 22.00497

Under \(\mathcal{H}_0\), this statistic is \(\chi^{2}(2)\) distributed with the p-value:

1 - pchisq(w, 2)
##              [,1]
## [1,] 1.666024e-05

Since this p-value is very small, we reject \(\mathcal{H}_0\) at both 5% and 1% significance levels.

Note that we can check our result by using the wald.test function from the mdscore package (to be installed if necessary), e.g.,

library(mdscore)
## Warning: package 'mdscore' was built under R version 4.3.3
wald.test(polio1.glm, 3:4)
## $W
## [1] 22.00497
## 
## $pvalue
## [1] 1.666024e-05
## 
## attr(,"class")
## [1] "wald.test"

Question 4

  1. Let there be \(n\) matches and \(p\) teams. We can construct a match matrix \(M \in \mathbb{R}^{n \times p}\) where each column represents a team and each row a match by assigning:

\[ x_{ij} = \begin{cases} +1 & \mbox{if team } j \mbox{ played at home in match } i \\ -1 & \mbox{if team } j \mbox{ played away in match } i \\ 0 & \mbox{if team } j \mbox{ did not play in match } i \end{cases} \]

However, notice that:

  • \(x_{i1} = - \sum_{j=2}^p x_{ij}\), meaning that \(M\) is not a design matrix because it will be singular. In particular, this means that the team strength is not identifiable.
  • Further, this matrix does not provide for an intercept.

Hence, we adjust the above matrix into the design matrix \(X \in \mathbb{R}^{n \times p}\), where the first column is \((1,\ldots, 1)^T\) and where the remaining columns are composed of any subset of \(p-1\) columns of \(M\). From the result table, the strength for Arsenal is omitted, which means that the column for Arsenal is not included in \(X\). In this setting, Arsenal is defined to be strength zero and all other strengths are fitted relative to this.

The intercept term represents the home team advantage as it is a constant additive effect to the home team strength.

  1. The simple model predicts odds on Chelsea winning of:

\[ e^{\hat{\beta}_0 + \hat{\beta}_{\text{Chelsea}} - \hat{\beta}_{\text{Newcastle}}} = e^{-0.284+(-0.498)-(-1.507)} = e^{0.725} = 2.06. \]

This is quite different from the bookies.

10.4 Chapter 4

Question 2

We will only compare polio.glm and polio1.glm here (see the solutions of Exercise 2a in Chapter 3 for the test hypothesis).

  1. First, we fit polio3.glm again but with an appropriate order for the variables and then get the analysis of deviance table:
temp_data <- rep(c(5.195, 5.138, 5.316, 5.242, 5.094, 5.108, 5.260, 5.153, 
                   5.155, 5.231, 5.234, 5.142, 5.173, 5.167), each = 12)
scaled_temp <- 10 * (temp_data - min(temp_data))/(max(temp_data) - min(temp_data))
uspolio$temp <- scaled_temp

polio3.glm <- glm(cases~time + I(cos(2*pi*time/12)) + I(sin(2*pi*time/12)) + 
                  I(cos(2*pi*time/6)) + I(sin(2*pi*time/6)) + temp, 
                  family=poisson(link=log), data=uspolio)
anova(polio3.glm)
## Analysis of Deviance Table
## 
## Model: poisson, link: log
## 
## Response: cases
## 
## Terms added sequentially (first to last)
## 
## 
##                          Df Deviance Resid. Df Resid. Dev
## NULL                                       167     343.00
## time                      1   9.4538       166     333.55
## I(cos(2 * pi * time/12))  1   3.4306       165     330.12
## I(sin(2 * pi * time/12))  1  19.3938       164     310.72
## I(cos(2 * pi * time/6))   1  21.3637       163     289.36
## I(sin(2 * pi * time/6))   1   0.5036       162     288.85
## temp                      1  12.0192       161     276.84

From the table, the test statistic for \(\mathcal{H}_0\): polio.glm and \(\mathcal{H}_1\): polio1.glm is:

\[ \frac{D({\tt polio.glm}, {\tt polio1.glm})}{\hat{\phi}} = \frac{333.55 - 310.72}{1} = 22.83. \]

Under \(\mathcal{H}_0\), this statistic is approximately \(\chi^{2}(2)\) distributed with the p-value:

1 - pchisq(22.83, 2)
## [1] 1.102881e-05

Since this p-value is very small, we reject \(\mathcal{H}_0\) at both 5% and 1% significance levels.

  1. We can check our result above using the following R code, which should give us a very similar p-value.
anova(polio.glm, polio1.glm, test="Chisq")
## Analysis of Deviance Table
## 
## Model 1: cases ~ time
## Model 2: cases ~ time + I(cos(2 * pi * time/12)) + I(sin(2 * pi * time/12))
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1       166     333.55                          
## 2       164     310.72  2   22.825 1.106e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  1. If we only have the summary of polio3.glm, we can use the null deviance and residual deviance in the summary to test the null model against polio3.glm.

Question 3

  1. From the table in Section 4.5.3, the test statistic for \(\mathcal{H}_0: \mathcal{M}_1\) and \(\mathcal{H}_1: \mathcal{M}_3\) is:

\[ \frac{D(\mathcal{M}_1, \mathcal{M}_3)}{\hat{\phi}} = \frac{8.172 - 5.785}{0.266} = 8.97. \]

Under \(\mathcal{H}_0\), this statistic is approximately \(\chi^{2}(2)\) distributed with the p-value:

1 - pchisq(8.97, 2)
## [1] 0.01127689

Thus, we reject \(\mathcal{H}_0\) at 5% but not at 1% significance level.

  1. The full model \(\mathcal{M}_7\) adds more parameters that do not help reduce the deviance while at the same time eating up degrees of freedom. This makes us fail to reject \(\mathcal{H}_0\).

  2. This is because R uses different dispersion estimates for the two anova calls. The first table uses the estimated dispersion of \(\mathcal{M}_7\), while the second table uses the estimated dispersion of \(\mathcal{M}_2\) (the bigger model).

1 - pchisq((8.1722-6.7879)/summary(m7)$disp, 1)
## [1] 0.02258198
1 - pchisq((8.1722-6.7879)/summary(m2)$disp, 1)
## [1] 0.04149395
Fahrmeir, Ludwig, and Gerhard Tutz. 2001. Multivariate Statistical Modelling Based on Generalized Linear Models (2nd Edition). Springer.
Green, P. J. 1984. “Iteratively Reweighted Least Squares for Maximum Likelihood Estimation, and Some Robust and Resistant Alternatives.” JRSSB 46 (2): 149–92.
Hauck, W. W., and A. Donner. 1976. “Wald’s Test as Applied to Hypotheses in Logit Analysis.” Journal of the American Statistical Association 72 (360a): 851–53.
Nelder, John A, and Robert WM Wedderburn. 1972. “Generalized Linear Models.” Journal of the Royal Statistical Society Series A: Statistics in Society 135 (3): 370–84.