Chapter 1 Introduction and Review

1.1 Introduction

In the first term, we have learned about categorical data analysis and some basics of generalised linear models (GLMs). In this term, we will continue to explore GLMs in more detail and study some of its more general variants. In particular, we will learn:

  • How to estimate the parameters of a GLM from data.

  • How to make a prediction and do inference once a GLM has been fitted.

  • How to perform deviance analysis with GLMs.

  • How to reduce overdispersion using quasi-likelihood methods.

  • How to model repeated measures data using marginal models.

  • How to model mixed effects using linear mixed models (LMMs) and generalised linear mixed models (GLMMs).

For the rest of this chapter, we will review some random vector and random matrix identities that will be useful later. Then we will review the basics of GLMs.

1.2 Random Vectors and Random Matrices: A Review

  • A random vector is a vector of random variables:

\[\boldsymbol{X} = \left(\begin{array}{c} X_1 \\ \vdots \\ X_n \end{array}\right).\]

  • The mean or expectation of \(\boldsymbol{X}\) is defined as:

\[E[\boldsymbol{X}] = \left(\begin{array}{c} E[X_1] \\ \vdots \\ E[X_n] \end{array}\right).\]

  • A random matrix is a matrix of random variables:

\[ \boldsymbol{Z} = (Z_{ij}) = \left(\begin{array}{ccc} Z_{11} & \ldots & Z_{1m} \\ \vdots & \ddots & \vdots \\ Z_{n1} & \ldots & Z_{nm} \end{array}\right). \]

  • The expectation of \(\boldsymbol{Z}\) is defined as:

\[ E[\boldsymbol{Z}] = (E[Z_{ij}]) = \left(\begin{array}{ccc} E[Z_{11}] & \ldots & E[Z_{1m}] \\ \vdots & \ddots & \vdots \\ E[Z_{n1}] & \ldots & E[Z_{nm}] \end{array}\right). \]

  • Some properties of random vectors and random matrices:

    • If \(\boldsymbol{a}\) is a constant (i.e., non-random) vector, \(E[\boldsymbol{a}] = \boldsymbol{a}\).

    • If \(\boldsymbol{A}\) is a constant matrix, \(E[\boldsymbol{A}] = \boldsymbol{A}\).

    • \(E[\boldsymbol{X} + \boldsymbol{Y}] = E[\boldsymbol{X}] + E[\boldsymbol{Y}]\) for any random matrices \(\boldsymbol{X}\) and \(\boldsymbol{Y}\).

    • \(E[\boldsymbol{A}\boldsymbol{X}] = \boldsymbol{A} E[\boldsymbol{X}]\) for a constant matrix \(\boldsymbol{A}\) and a random matrix \(\boldsymbol{X}\).

    • More generally, \(E[\boldsymbol{A}\boldsymbol{X}\boldsymbol{B} + \boldsymbol{C}] = \boldsymbol{A}E[\boldsymbol{X}]\boldsymbol{B} + \boldsymbol{C}\) for constant matrices \(\boldsymbol{A}\), \(\boldsymbol{B}\) and \(\boldsymbol{C}\).

  • Let \(\boldsymbol{X}\) be a random vector. The covariance matrix of \(\boldsymbol{X}\) is defined as:

\[ cov(\boldsymbol{X}) = (cov(X_i, X_j)) = \left(\begin{array}{cccc} var(X_1) & cov(X_1, X_2) & \ldots & cov(X_1, X_n) \\ cov(X_2, X_1) & var(X_2) & \ldots & cov(X_2, X_n) \\ \vdots & \vdots & \ddots & \vdots \\ cov(X_n, X_1) & cov(X_n, X_2) & \ldots & var(X_n) \end{array}\right). \]

  • We can also write:

\[\begin{aligned} cov(\boldsymbol{X}) &= E[(\boldsymbol{X} - E[\boldsymbol{X}])(\boldsymbol{X} - E[\boldsymbol{X}])^T] \\ &= E\left[ \left(\begin{array}{c} X_1 - E[X_1] \\ \vdots \\ X_n - E[X_n] \end{array}\right) \left( X_1 - E[X_1], \ldots, X_n - E[X_n] \right) \right]. \end{aligned}\]

  • Some properties of covariance matrices:

    • They are symmetric: \(cov(\boldsymbol{X}) = cov(\boldsymbol{X})^T\).

    • If \(\boldsymbol{a}\) is a constant vector, \(cov(\boldsymbol{X} + \boldsymbol{a}) = cov(\boldsymbol{X})\).

    • If \(\boldsymbol{A}\) is a constant matrix, \(cov(\boldsymbol{A} \boldsymbol{X}) = \boldsymbol{A} cov(\boldsymbol{X}) \boldsymbol{A}^T\).

    • \(cov(\boldsymbol{X}) = E[\boldsymbol{X} \boldsymbol{X}^T] - E[\boldsymbol{X}] E[\boldsymbol{X}]^T\).

  • Let \(\boldsymbol{X}_{n \times 1}\) and \(\boldsymbol{Y}_{m \times 1}\) be random vectors where \(n\) and \(m\) could be different. The covariance matrix of \(\boldsymbol{X}\) and \(\boldsymbol{Y}\) is defined as:

\[ cov(\boldsymbol{X}, \boldsymbol{Y}) = (cov(X_i, Y_j)) = \left(\begin{array}{cccc} cov(X_1, Y_1) & cov(X_1, Y_2) & \ldots & cov(X_1, Y_m) \\ cov(X_2, Y_1) & cov(X_2, Y_2) & \ldots & cov(X_2, Y_m) \\ \vdots & \vdots & \ddots & \vdots \\ cov(X_n, Y_1) & cov(X_n, Y_2) & \ldots & cov(X_n, Y_m) \end{array}\right). \]

  • We can also write:

\[\begin{aligned} cov(\boldsymbol{X}, \boldsymbol{Y}) &= E[(\boldsymbol{X} - E[\boldsymbol{X}])(\boldsymbol{Y} - E[\boldsymbol{Y}])^T] \\ &= E\left[ \left(\begin{array}{c} X_1 - E[X_1] \\ \vdots \\ X_n - E[X_n] \end{array}\right) \left( Y_1 - E[Y_1], \ldots, Y_m - E[Y_m] \right) \right]. \end{aligned}\]

  • Some properties of covariance matrices between two vectors:

    • If \(\boldsymbol{A}\) and \(\boldsymbol{B}\) are constant matrices, \(cov(\boldsymbol{A} \boldsymbol{X}, \boldsymbol{B} \boldsymbol{Y}) = \boldsymbol{A} cov(\boldsymbol{X}, \boldsymbol{Y}) \boldsymbol{B}^T.\)

    • If \(\boldsymbol{Z} = \left(\begin{array}{c} \boldsymbol{X} \\ \boldsymbol{Y} \end{array}\right)\), then \(cov(\boldsymbol{Z}) = \left(\begin{array}{cc} cov(\boldsymbol{X}) & cov(\boldsymbol{X}, \boldsymbol{Y}) \\ cov(\boldsymbol{Y}, \boldsymbol{X}) & cov(\boldsymbol{Y}) \end{array}\right)\).

  • Let \(\boldsymbol{X}\) be a random vector. The correlation matrix of \(\boldsymbol{X}\) is defined as:

\[ corr(\boldsymbol{X}) = (corr(X_i, X_j)) = \left(\begin{array}{cccc} 1 & corr(X_1, X_2) & \ldots & corr(X_1, X_n) \\ corr(X_2, X_1) & 1 & \ldots & corr(X_2, X_n) \\ \vdots & \vdots & \ddots & \vdots \\ corr(X_n, X_1) & corr(X_n, X_2) & \ldots & 1 \end{array}\right), \] where

\[corr(X_i, X_j) = \frac{cov(X_i, X_j)}{\sqrt{var(X_i) var(X_j)}}.\]

1.3 Generalised Linear Models: A Review

Generalised linear models (GLMs) are developed by Nelder and Wedderburn (1972) as a way to unify other statistical models as well as the statistical methods operated on these models. In this section, we will briefly review the definition of GLMs. Please refer to the last term’s lecture notes for more detailed derivations and examples.

Definition. A GLM is specified through the following components:

  • A linear predictor: \(\eta = \boldsymbol{\beta}^{T}\boldsymbol{x}\).

  • An injective response function \(h\), such that \(\mu = {\mathrm E}[Y |\boldsymbol{x}, \boldsymbol{\beta}] = h(\eta) = h(\boldsymbol{\beta}^{T}\boldsymbol{x})\).
    Equivalently, one can write \(g(\mu) = \boldsymbol{\beta}^{T}\boldsymbol{x}\), where \(g = h^{-1}\) is the link function.

  • The distributional assumption: \(P_{}\left(Y |\boldsymbol{x}, \boldsymbol{\beta}\right)\) is an EDF, that is: \[\begin{equation} P_{}\left(y |\boldsymbol{x}, \boldsymbol{\beta}\right) = P_{}\left(y |\theta(\boldsymbol{x}, \boldsymbol{\beta}), \phi(\boldsymbol{x}, \boldsymbol{\beta})\right) = \exp \Big( \frac{y\theta - b(\theta)}{\phi} + c(y, \phi) \Big). \end{equation}\] From the properties of the EDF, the mean and variance of this distribution are: \[\begin{align} {\mathrm E}[Y |\theta, \phi] &= \mu = b'(\theta) \\ {\mathrm{Var}}[Y |\theta, \phi] &= \phi \, b''(\theta) = \phi \, b''((b')^{-1}(\mu)) = \phi \, \mathcal{V}(\mu). \end{align}\]

  • We also assume independent data, that is: \[\begin{equation} P_{}\left(\left\{y_{i}\right\} |\left\{\boldsymbol{x}_{i}\right\}, \boldsymbol{\beta}\right) = \prod_{i=1}^n P_{}\left(y_{i} |\boldsymbol{x}_{i}, \boldsymbol{\beta}\right) \end{equation}\] where \(\left\{y_{i}, i = 1,...,n\right\}\) are response data given the \(\left\{\boldsymbol{x}_i, i = 1,...,n\right\}\).

The Natural/Canonical Link. Recall that we have both: \[\begin{alignat}{4} \mu & = {\mathrm E}[Y |\theta, \phi] && = b'(\theta) \tag{1.1} \\ \mu & = {\mathrm E}[Y |\boldsymbol{x}, \boldsymbol{\beta}] && = h(\boldsymbol{\beta}^T\boldsymbol{x}) = h(\eta) \tag{1.2} \end{alignat}\] with Equation (1.1) holding as a result of \(P_{}\left(y |\theta, \phi\right)\) following an EDF distribution, and Equation (1.2) holding by definition for a GLM.

The natural link is the choice \(h = b'\), or equivalently \(g = (b')^{-1}\), resulting in the equation \[\begin{equation} \theta = \boldsymbol{\beta}^T\boldsymbol{x} = \eta. \end{equation}\]

1.4 Exercises

Question 1

Prove the identities in Section 1.2.

Question 2

The table below gives some common link functions for GLMs. In the table, \(\Phi(\cdot)\) is the cdf of the standard normal distribution. Find their inverses (i.e., the response function \(h\)).

Link function \(\eta_i = g(\mu_i)\)
Identity \(\mu_i\)
Log \(\log(\mu_i)\)
Inverse \(\mu_i^{-1}\)
Inverse-square \(\mu_i^{-2}\)
Square-root \(\sqrt{\mu_i}\)
Logit \(\log\frac{\mu_i}{1-\mu_i}\)
Probit \(\Phi^{-1}(\mu_i)\)
Log-log \(-\log(-\log\mu_i)\)
Complementary log-log \(\log(-\log(1-\mu_i))\)

Question 3

Write down the GLM components and the corresponding natural link when the response variable \(Y\) is assumed to follow each distribution below. Note that the values of \(y_i\) have to be in the correct range for each distribution.

  • Gaussian: \(y_i |\boldsymbol{x}_i, \boldsymbol{\beta} \sim \mathcal{N}(\mu_i, \sigma^2)\).
  • Bernoulli: \(y_i |\boldsymbol{x}_i, \boldsymbol{\beta} \sim Bernoulli(\mu_i)\).
  • Binomial: \(y_i |\boldsymbol{x}_i, \boldsymbol{\beta} \sim Bin(n_i, \pi_i)\).
  • Rescaled Binomial: \(y_i |\boldsymbol{x}_i, \boldsymbol{\beta} \sim \frac{1}{n_i} Bin(n_i, \mu_i)\). Note that here \(y_i\) is the proportion of successes in \(n_i\) independent trials with probability of success \(\mu_i\). Thus, the range of \(y_i\) is \(\{ \frac{0}{n_i}, \frac{1}{n_i}, \ldots, \frac{n_i}{n_i} \}\).
  • Poisson: \(y_i |\boldsymbol{x}_i, \boldsymbol{\beta} \sim Poi(\mu_i)\).
  • Gamma: \(y_i |\boldsymbol{x}_i, \boldsymbol{\beta} \sim Gamma(\nu, \alpha_i)\), where \(\nu\) and \(\alpha_i\) are the shape and rate parameters of the Gamma distribution.
  • Inverse-Gaussian: \(y_i |\boldsymbol{x}_i, \boldsymbol{\beta} \sim IG(\mu_i, \lambda)\).

From these results, show that the linear regression model, logistic regression model, and Poisson regression model are all special cases of GLMs.

Question 4

In \(\texttt{R}\), find the default link function for each GLM family. Are they the natural link for the corresponding family?

Question 5

(From Dobson & Barnett, 2018)

The following associations can be described by generalized linear models. For each one, identify the response variable and the explanatory variables, select a probability distribution for the response (justifying your choice) and write down the linear component.

  1. The effect of age, sex, height, mean daily food intake and mean daily energy expenditure on a person’s weight.
  2. The proportions of laboratory mice that became infected after exposure to bacteria when five different exposure levels are used and 20 mice are exposed at each level.
  3. The association between the number of trips per week to the supermarket for a household and the number of people in the household, the household income and the distance to the supermarket.

Question 6

Suppose we have an (imagined) dataset of the number of trips to the supermarket for a household and the household size, their council tax band, and whether they have a car.

  • Convert this into grouped data and find the values of \(n\), \(m_i\), \(y_i\).
  • Assuming we model this grouped data using a Poisson GLM, find the dispersion \(\phi_i\) for each group.
Number of trips Size Council tax band Car
3 1 C 1
1 2 B 0
1 1 A 0
3 4 B 0
2 1 A 0
4 3 C 1
2 1 C 1
2 2 B 0
1 1 A 0
2 3 C 0
3 4 C 1
2 4 B 0

References

Nelder, John A, and Robert WM Wedderburn. 1972. “Generalized Linear Models.” Journal of the Royal Statistical Society Series A: Statistics in Society 135 (3): 370–84.