Chapter 2 Quantifying Uncertainty Via the Bootstrap
2.1 Our Goalposts
Recall that the goal of Frequentist inference is to obtain estimators, intervals, and hypothesis tests that have strong properties with respect to the sampling distribution (as opposed to the posterior distribution). Given data \(\mathcal D\) a Frequentist approach might be to construct an interval estimate for a parameter \(\psi\) such that \[ G_\theta\{L(\mathcal D) \le \psi \le U(\mathcal D)\} = 1 - \alpha, \] for a desired confidence level \(1 - \alpha\). Such intervals are often of the form \(\widehat\psi \pm z_{\alpha/2} \, s_{\widehat\psi}\), where where \(\widehat \psi\) is a point estimate, \(s_{\widehat \psi}\) is an estimate of the standard deviation of \(\widehat \psi\), and \(z_{\alpha/2}\) corresponds to an appropriate quantile of the standard normal distribution. While rarely possible, we would like coverage to hold exactly and without depending on \(\theta\).
2.2 The Bootstrap Principle
It is not always possible, given a sample \(\mathcal D\sim G\), to determine the sampling distribution of a statistic \(\widehat \psi = \widehat \psi(\mathcal D)\). This is because we do not know the distribution \(G\); of course, if we knew \(G\), we would not need to do any inference. The bootstrap gets around this problem by using the data to estimate \(G\) from the data to obtain some \(\widehat G\). Given \(\widehat G\), we can compute the sampling distribution of \(\widehat\psi^\star = \widehat\psi(\mathcal D^\star)\) where \(\mathcal D^\star \sim \widehat G\).
The Bootstrap Principle Suppose that \(\mathcal D\sim G\), \(\psi = \psi(G)\) is some parameter of the distribution \(G\) of interest, and \(\widehat\psi(\mathcal D)\) is some statistic aimed at estimating \(\psi\). Then we can evaluate the sampling distribution of \(\widehat\psi(\mathcal D)\) by
estimating \(G\) with some \(\widehat G\); and
using the sampling distribution of \(\widehat\psi(\mathcal D^\star)\) as an estimate of the sampling distribution of \(\widehat\psi(\mathcal D)\).
Implementing the bootstrap principle has two minor complications. First, how do we estimate \(G\)? Second, how do we compute the distribution sampling distribution of \(\widehat\psi(\mathcal D^\star)\)?
How we estimate \(G\) typically depends on the structure of the problem. Suppose, for example, that \(\mathcal D= (X_1, \ldots, X_N)\) which are sampled iid from \(F\) (so that \(G = F^N\)). Then a standard choice is to use the empirical distribution function \(\widehat F= \mathbb F_N = N^{-1} \sum_{i=1}^N \delta_{X_i}\) where \(\delta_x\) is the point mass at \(x\) (so that \(\widehat G= {\widehat F}^N\)); this is referred to as the nonparametric bootstrap because it does not depend on any parametric assumptions about \(F\).
In all but the simplest settings, Monte Carlo is used to approximate the sampling distribution of \(\widehat\psi^\star\). That is, we sample \(\mathcal D^\star_1, \ldots, \mathcal D^\star_B\) independently from \(\widehat G\) and take \(\frac{1}{B} \sum_{b=1}^B \delta_{\widehat\psi^\star_b}\) as our approximation of the sampling distribution of \(\widehat\psi\), where \(\widehat\psi^\star_b = \widehat\psi(\mathcal D^\star_b)\).
Exercise 2.1 Suppose that \(X_1, \ldots, X_N \stackrel{\text{iid}}{\sim}F\) and let \(\psi(F)\) denote the population mean of \(F\), i.e., \(\psi(F) = \mathbb E_F(X_i) = \int x \, F(dx)\). We consider bootstrapping the sample mean \(\bar X_N = N^{-1} \sum_{i=1}^N X_i\) using the approximation \(\widehat F = \mathbb F_N\). That is, we consider the sampling distribution of \(\bar X^\star = N^{-1} \sum_{i=1}^N X_i^\star\) where \(X_1^\star, \ldots, X_N^\star\) are sampled independently from \(\mathbb F_N\).
What is \(\psi(\mathbb F_N)\)?
The actual bias of \(\bar X_N\) is \(\mathbb E_F\{\bar X_N - \psi(F)\} = 0\). What is the bootstrap estimate of the bias \(\mathbb E_{\mathbb F}(\bar X^\star_N - \bar X)\)?
The variance of \(\bar X_N\) is \(\sigma^2_F / N\) where \(\sigma^2_F\) is \(\operatorname{Var}_F(X_i)\). What is the bootstrap estimate of the variance of \(\bar X\), \(\operatorname{Var}_{\mathbb F_N}(\bar X^\star)\)?
A parameter \(\psi\) is said to be linear if it can be written as \(\psi(F) = \int t(x) \, F(dx)\) for some choice of \(t(x)\). In this case it is natural to estimate \(\psi\) using \(\bar T = N^{-1} \sum_i t(X_i)\). Write down the bootstrap estimate of the bias and variance of \(\bar T\) in this setting.
Given the sampling distribution of \(\widehat\psi\), we can do things like construct confidence intervals for \(\psi\). For example, it is often the case that \(\widehat\psi\) is asymptotically normal and centered at \(\psi\). We can then use the bootstrap estimate of \(\operatorname{Var}(\widehat\psi)\) to make the confidence interval \[ \begin{aligned} \widehat\psi\pm z_{\alpha/2} \sqrt{\operatorname{Var}_{\widehat G}(\widehat\psi^\star)}. \end{aligned} \] In this way, the bootstrap gives us a way to estimate \(\operatorname{Var}(\widehat\psi)\) more-or-less automatically.
For the next problem, we recall the delta method approach to computing standard errors. Suppose that \(\widehat\mu\) has mean \(\mu\) and variance \(\tau^2\) and that we want to approximate the mean and variance of \(g(\widehat\mu)\). The delta method states that, if \(\tau\) is sufficiently small, then \(\mathbb E\{g(\widehat\mu)\} \approx g(\mu)\) and \(\operatorname{Var}\{g(\widehat\mu)\} \approx g'(\mu)^2 \tau^2\). This is based on the somewhat crude approximation \[ g(\widehat\mu) \approx g(\mu) + (\widehat\mu- \mu) g'(\mu) + \text{remainder} \] with the remainder being of order \(O(\tau^2)\). The delta method approximation is obtained by ignoring the remainder.
Exercise 2.2 Let \(X_1, \ldots, X_n \stackrel{\text{iid}}{\sim}\operatorname{Normal}(\mu,1)\) and let \(\psi = e^\mu\) and \(\widehat\psi= e^{\bar X_n}\) be the MLE of \(\psi\). Create a dataset using \(\mu = 5\) consisting of \(n = 20\) observations.
Use the delta method to get the standard error and 95% confidence interval for \(\psi\).
Use the nonparametric bootstrap to get the standard error and 95% confidence interval for \(\psi\).
The parametric bootstrap makes use of the assumption that \(F\) (in this case) is a normal distribution. Specifically, we take \(\widehat F\) equal to its maximum likelihood estimate, the \(\operatorname{Normal}(\bar X_n, 1)\) distribution. Using the parametric bootstrap, compute the standard error and a 95% confidence interval for \(\psi\).
Plot a histogram of the bootstrap replications for the parametric and nonparametric bootstraps, along with the approximation of the sampling distribution of \(\widehat\psi\) obtained from the delta method (i.e., \(\operatorname{Normal}(\widehat\psi, \widehat s^2)\)). Compare these to the true sampling distribution of \(\widehat\psi\). Which approximation is closest to the true distribution?
Depending on the random data generated for this exercise, you most likely will find that the sampling distribution of \(\widehat\psi\) estimated by both the bootstrap and the delta method are not so good; the biggest problem is that the sampling distribution will be locatin-shifted by \(\widehat\psi- psi\). Repeat part (d), but instead comparing the sampling distribution of \(\widehat\psi- \psi\) to the bootstrap estimates obtained by sampling \(\widehat\psi^\star - \widehat\psi\).
The lesson of part (e) is that the bootstrap approximation is likely to be best when we apply it to pivotal quantities. A quantity \(S(X, \psi)\) (which is allowed to depend on \(\psi\)) is said to be pivotal if it has a distribution which is independent of \(\psi\). For example, in Exercise 2.2 the statistic \(\sqrt n(\bar X - \mu)\) is a pivotal quantity, and in general \(Z = \frac{\sqrt n (\bar X - \mu)}{s}\) is asymptotically pivotal (where \(s\) is the sample standard deviation).
Exercise 2.3 While we saw an improved approximation for \(\widehat\psi- \psi\), argue that this is nevertheless not a pivotal quantity. Propose a pivotal quantity \(S(\widehat\psi, \psi)\) which is more suitable for bootstrapping.
The intervals computed in the previous exercise rely on asymptotic normality, which we may like to avoid. An alternative approach is to apply the bootstrap to \(\zeta = \widehat\psi- \psi(F)\) rather than to \(\widehat\psi\) directly, so that \(\psi(F) = \widehat\psi- \zeta\). If we knew the \(\alpha/2\) and \((1 - \alpha/2)\) quantiles of \(\zeta\) (say, \(\zeta_{\alpha/2}\) and \(\zeta_{1-\alpha/2}\)), then we could form a confidence interval \[ \begin{aligned} G_\theta(\widehat\psi- \zeta_{1-\alpha/2} \le \psi \le \widehat\psi- \zeta_{\alpha/2}) = 1 - \alpha. \end{aligned} \]
The empirical bootstrap estimates these quantiles from the quantiles of \(\widehat\psi^\star - \psi(\widehat F)\), which are computed by simulation. More generally, we could use this approach for any pivotal quantity; for example, since \(\xi = \widehat\psi/ \psi\) is pivotal in Exercise 2.2, we could use the interval \((\widehat\psi/ \xi_{1-\alpha/2}, \widehat\psi/ \xi_{\alpha/2})\) as our interval.
Exercise 2.4 Use the nonparametric bootstrap to make a 95% confidence interval using the pivotal quantity \(\xi\) described above.