1.2 Bayesian framework: A brief summary of theory

Given an unknown parameter set \(\boldsymbol{\theta}\), and a particular realization of the data \(\mathbf{y}\), Bayes’ rule may be applied analogously,⁵ \[\begin{align} \pi(\boldsymbol{\theta}\mid \mathbf{y})&=\frac{p(\mathbf{y}\mid \boldsymbol{\theta}) \times \pi(\boldsymbol{\theta})}{p(\mathbf{y})}, \tag{1.3} \end{align}\]

where \(\pi(\boldsymbol{\theta}\mid \mathbf{y})\) is the posterior density function, \(\pi(\boldsymbol{\theta})\) is the prior density, \(p(\mathbf{y}\mid \boldsymbol{\theta})\) is the likelihood (statistical model), and

\[\begin{equation} p(\mathbf{y})=\int_{\mathbf{\Theta}}p(\mathbf{y}\mid \boldsymbol{\theta})\pi(\boldsymbol{\theta})d\boldsymbol{\theta}=\mathbb{E}\left[p(\mathbf{y}\mid \boldsymbol{\theta})\right] \tag{1.4} \end{equation}\]

is the marginal likelihood or prior predictive. Observe that for this expected value to be meaningful, the prior should be a proper density, that is, it must integrate to one; otherwise, it does not make sense.

Observe that \(p(\mathbf{y} \mid \boldsymbol{\theta})\) is not a density in \(\boldsymbol{\theta}\). In addition, \(\pi(\boldsymbol{\theta})\) does not have to integrate to 1, that is, \(\pi(\boldsymbol{\theta})\) can be an improper density function, \(\int_{\mathbf{\Theta}} \pi(\boldsymbol{\theta}) d\boldsymbol{\theta} = \infty\). However, \(\pi(\boldsymbol{\theta} \mid \mathbf{y})\) is a proper density function, that is, \(\int_{\mathbf{\Theta}} \pi(\boldsymbol{\theta} \mid \mathbf{y}) d\boldsymbol{\theta} = 1\).

For instance, set \(\pi(\boldsymbol{\theta}) = c\), where \(c\) is a constant, then \(\int_{\mathbf{\Theta}} c d\boldsymbol{\theta} = \infty\). However, \[ \int_{\mathbf{\Theta}} \pi(\boldsymbol{\theta} \mid \mathbf{y}) d\boldsymbol{\theta} = \int_{\mathbf{\Theta}} \frac{p(\mathbf{y} \mid \boldsymbol{\theta}) \times c}{\int_{\mathbf{\Theta}} p(\mathbf{y} \mid \boldsymbol{\theta}) \times c \, d\boldsymbol{\theta}} d\boldsymbol{\theta} = 1 \] where \(c\) cancels out.

\(\pi(\boldsymbol{\theta} \mid \mathbf{y})\) is a sample updated ``probabilistic belief” version of \(\pi(\boldsymbol{\theta})\), where \(\pi(\boldsymbol{\theta})\) is a prior probabilistic belief which can be constructed from previous empirical work, theoretical foundations, expert knowledge, and/or mathematical convenience. This prior usually depends on parameters, which are named . In addition, the Bayesian approach implies using a probabilistic model about \(\mathbf{Y}\) given \(\boldsymbol{\theta}\), that is, \(p(\mathbf{y} \mid \boldsymbol{\theta})\), where its integral over \(\mathbf{\Theta}\), \(p(\mathbf{y})\), is named due to being a measure of model fit to the data.

Observe that the Bayesian inferential approach is conditional, that is, what can we learn about an unknown object \(\boldsymbol{\theta}\) given that we already observed \(\mathbf{ Y} =\mathbf{y}\)? The answer is also conditional on the probabilistic model, that is, \(p(\mathbf{y} \mid \boldsymbol{\theta})\). So, what if we want to compare different models, say \(\mathcal{M}_m\), \(m = \{1,2,\dots,M\}\)? Then, we should make explicit this in the Bayes’ rule formulation: \[\begin{align} \pi(\boldsymbol{\theta}\mid \mathbf{y},\mathcal{M}_m)&=\frac{p(\mathbf{y}\mid \boldsymbol{\theta},\mathcal{M}_m) \times \pi(\boldsymbol{\theta}\mid \mathcal{M}_m)}{p(\mathbf{y}\mid \mathcal{M}_m)}. \tag{1.5} \end{align}\]

The posterior model probability is \[\begin{align} \pi(\mathcal{M}_m\mid \mathbf{y})&=\frac{p(\mathbf{y}\mid \mathcal{M}_m) \times \pi(\mathcal{M}_m)}{p(\mathbf{y})}, \tag{1.6} \end{align}\]

where \(p(\mathbf{y}\mid \mathcal{M}_m)=\int_{\mathbf{\Theta}}p(\mathbf{y}\mid \boldsymbol{\theta},\mathcal{M}_m) \times \pi(\boldsymbol{\theta}\mid \mathcal{M}_m)d\boldsymbol{\theta}\) due to equation (1.5), and \(\pi(\mathcal{M}_m)\) is the prior model probability.

Calculating \(p(\mathbf{y})\) in equations (1.3) and (1.6) is very demanding in most of the realistic cases. Fortunately, it is not required when performing inference about \(\boldsymbol{\theta}\) as this is integrated out from it. Then, all you need to know about the shape of \(\boldsymbol{\theta}\) is in \(p(\mathbf{y} \mid \boldsymbol{\theta}, \mathcal{M}_m) \times \pi(\boldsymbol{\theta} \mid \mathcal{M}_m)\), or without explicitly conditioning on \(\mathcal{M}_m\), \[\begin{align} \pi(\boldsymbol{\theta}\mid \mathbf{y})& \propto p(\mathbf{y}\mid \boldsymbol{\theta}) \times \pi(\boldsymbol{\theta}). \tag{1.7} \end{align}\]

Equation (1.7) is a very good shortcut to perform Bayesian inference about \(\boldsymbol{\theta}\).

We can also avoid calculating \(p(\mathbf{y})\) when performing model selection (hypothesis testing) using the posterior odds ratio, that is, comparing models \(\mathcal{M}_1\) and \(\mathcal{M}_2\),

\[\begin{align} PO_{12}&=\frac{\pi(\mathcal{M}_1\mid \mathbf{y})}{\pi(\mathcal{M}_2\mid \mathbf{y})} \nonumber \\ &=\frac{p(\mathbf{y}\mid \mathcal{M}_1)}{p(\mathbf{y}\mid \mathcal{M}_2)}\times\frac{\pi(\mathcal{M}_1)}{\pi(\mathcal{M}_2)}, \tag{1.8} \end{align}\]

where the first term in equation (1.8) is named the Bayes factor, and the second term is the prior odds. Observe that the Bayes factor is a ratio of ordinates for \(\mathbf{y}\) under different models. Then, the Bayes factor is a measure of relative sample evidence in favor of model 1 compared to model 2.

However, we still need to calculate \(p(\mathbf{y}\mid \mathcal{M}_m) = \int_{\mathbf{\Theta}} p(\mathbf{y}\mid \boldsymbol{\theta}, \mathcal{M}_m) \pi(\boldsymbol{\theta}\mid \mathcal{M}_m) d\boldsymbol{\theta} = \mathbb{E}\left[ p(\mathbf{y}\mid \boldsymbol{\theta}, \mathcal{M}_m) \right]\). For this integral to be meaningful, the prior must be proper. Using an improper prior has unintended consequences when comparing models; for instance, parsimonious models are favored by posterior odds or Bayes factors, and these values may depend on units of measure (see Chapter 3).

A nice feature of comparing models using posterior odds is that if we have an exhaustive set of competing models such that \(\sum_{m=1}^M \pi(\mathcal{M}_m \mid \mathbf{y}) = 1\), then we can recover \(\pi(\mathcal{M}_m \mid \mathbf{y})\) without calculating \(p(\mathbf{y})\). In particular, given two models \(\mathcal{M}_1\) and \(\mathcal{M}_2\) such that \(\pi(\mathcal{M}_1 \mid \mathbf{y}) + \pi(\mathcal{M}_2 \mid \mathbf{y}) = 1\), we have: \[ \pi(\mathcal{M}_1 \mid \mathbf{y}) = \frac{PO_{12}}{1 + PO_{12}} \quad \text{and} \quad \pi(\mathcal{M}_2 \mid \mathbf{y}) = 1 - \pi(\mathcal{M}_1 \mid \mathbf{y}). \] In general, \[ \pi(\mathcal{M}_m \mid \mathbf{y}) = \frac{p(\mathbf{y} \mid \mathcal{M}_m) \times \pi(\mathcal{M}_m)}{\sum_{l=1}^M p(\mathbf{y} \mid \mathcal{M}_l) \times \pi(\mathcal{M}_l)}. \] These posterior model probabilities can be used to perform Bayesian model averaging.

Table 1.1 shows guidelines for the interpretation of \(2\log(PO_{12})\) (R. E. Kass and Raftery 1995). This transformation is done to replicate the structure of the likelihood ratio test statistic. However, posterior odds do not require nested models as the likelihood ratio test does.

Table 1.1: Kass and Raftery guidelines
\(2\log(PO_{12})\)	\(PO_{12}\)	Evidence against \(\mathcal{M}_{2}\)
0 to 2	1 to 3	Not worth more than a bare mention
2 to 6	3 to 20	Positive
6 to 10	20 to 150	Strong
> 10	> 150	Very strong

Observe that the posterior odds ratio is a relative criterion, that is, we specify an exhaustive set of competing models and compare them. However, we may want to check the performance of a model on its own or use a non-informative prior. In this case, we can use the posterior predictive p-value (A. Gelman and Meng 1996; A. Gelman, Meng, and Stern 1996).⁶

The intuition behind the predictive p-value is simple: analyze the discrepancy between the model’s assumptions and the data by checking a potential extreme tail-area probability. Observe that this approach does not check if a model is true; its focus is on potential discrepancies between the model and the data at hand.

This is done by simulating pseudo-data from our sampling model (\(\mathbf{y}^{(s)}, s=1,2,\dots,S\)) using draws from the posterior distribution, and then calculating a discrepancy measure, \(D(\mathbf{y}^{(s)},\boldsymbol{\theta})\), to estimate the posterior predictive p-value, \[ p_D(\mathbf{y}) = P[D(\mathbf{y}^{(s)},\boldsymbol{\theta}) \geq D(\mathbf{y},\boldsymbol{\theta})], \] using the proportion of the \(S\) draws for which \(D(\mathbf{y}^{(s)},\boldsymbol{\theta}^{(s)}) \geq D(\mathbf{y},\boldsymbol{\theta}^{(s)})\). Extreme tail probabilities (\(p_D(\mathbf{y}) \leq 0.05\) or \(p_D(\mathbf{y}) \geq 0.95\)) suggest potential discrepancies between the data and the model. A. Gelman, Meng, and Stern (1996) also suggest the posterior predictive p-value based on the minimum discrepancy, \[ D_{\min}(\mathbf{y}) = \min_{\boldsymbol{\theta}} D(\mathbf{y}, \boldsymbol{\theta}), \] and the average discrepancy statistic \[ D(\mathbf{y}) = \mathbb{E}[D(\mathbf{y}, \boldsymbol{\theta})] = \int_{\mathbf{\Theta}} D(\mathbf{y}, \boldsymbol{\theta}) \pi(\boldsymbol{\theta} \mid \mathbf{y}) d\boldsymbol{\theta}. \] These alternatives can be more computationally demanding.

The Bayesian approach is also suitable to get probabilistic predictions, that is, we can obtain a posterior predictive density

\[\begin{align} \pi(\mathbf{y}_0\mid \mathbf{y},\mathcal{M}_m) & =\int_{\mathbf{\Theta}}\pi(\mathbf{y}_0,\boldsymbol{\theta}\mid \mathbf{y},\mathcal{M}_m)d\boldsymbol{\theta}\nonumber\\ &=\int_{\mathbf{\Theta}}\pi(\mathbf{y}_0\mid \boldsymbol{\theta},\mathbf{y},\mathcal{M}_m)\pi(\boldsymbol{\theta}\mid \mathbf{y},\mathcal{M}_m)d\boldsymbol{\theta}. \tag{1.9} \end{align}\]

Observe that equation (1.9) is again an expectation \(\mathbb{E}[\pi(\mathbf{y}_0 \mid \boldsymbol{\theta}, \mathbf{y}, \mathcal{M}_m)]\), this time using the posterior distribution. Therefore, the Bayesian approach takes estimation error into account when performing prediction.

As we have shown many times, expectation (integration) is a common feature in Bayesian inference. That is why the remarkable relevance of computation based on Monte Carlo integration in the Bayesian framework.

Bayesian model averaging (BMA) allows for considering model uncertainty in prediction or any unknown probabilistic object. In the case of the predictive density,

\[\begin{align} \pi(\mathbf{y}_0\mid \mathbf{y})&=\sum_{m=1}^M \pi(\mathcal{M}_m\mid \mathbf{y})\pi(\mathbf{y}_0\mid \mathbf{y},\mathcal{M}_m). \end{align}\] In the case of the posterior density of the parameters, \[\begin{align} \pi(\boldsymbol{\theta}\mid \mathbf{y})&=\sum_{m=1}^M \pi(\mathcal{M}_m\mid \mathbf{y})\pi(\boldsymbol{\theta}\mid \mathbf{y},\mathcal{M}_m), \end{align}\] where \[\begin{align} \mathbb{E}(\boldsymbol{\theta}\mid \mathbf{y})=\sum_{m=1}^{M}\hat{\boldsymbol{\theta}}_m \pi(\mathcal{M}_m\mid \mathbf{y}), \tag{1.10} \end{align}\] and \[\begin{align} Var({\theta}_k\mid \mathbf{y})= \sum_{m=1}^{M}\pi(\mathcal{M}_m\mid \mathbf{y}) \widehat{Var} ({\theta}_{km}\mid \mathbf{y},\mathcal{M}_m)+\sum_{m=1}^{M} \pi(\mathcal{M}_m\mid \mathbf{y}) (\hat{{\theta}}_{km}-\mathbb{E}[{\theta}_{km}\mid \mathbf{y}])^2, \tag{1.11} \end{align}\] \(\hat{\boldsymbol{\theta}}_m\) is the posterior mean and \(\widehat{Var}({\theta}_{km}\mid \mathbf{y},\mathcal{M}_m)\) is the posterior variance of the \(k\)-th element of \(\boldsymbol{\theta}\) under model \(\mathcal{M}_m\).

Observe how the variance in equation (1.11) captures the extra variability due to potential differences between the mean posterior estimates associated with each model, and the posterior mean that incorporates model uncertainty in equation (1.10).

A significant advantage of the Bayesian approach, which is particularly useful in (see Chapter 8), is the way the posterior distribution updates with new sample information. Given \(\mathbf{y} = \mathbf{y}_{1:t+1}\) as a sequence of observations from 1 to \(t+1\), then \[\begin{align} \pi(\boldsymbol{\theta}\mid \mathbf{y}_{1:t+1})&\propto p(\mathbf{y}_{1:t+1}\mid \boldsymbol{\theta})\times \pi(\boldsymbol{\theta})\nonumber\\ &= p(y_{t+1}\mid \mathbf{y}_{1:t},\boldsymbol{\theta})\times p(\mathbf{y}_{1:t}\mid \boldsymbol{\theta})\times \pi(\boldsymbol{\theta})\nonumber\\ &\propto p(y_{t+1}\mid \mathbf{y}_{1:t},\boldsymbol{\theta})\times \pi(\boldsymbol{\theta}\mid \mathbf{y}_{1:t}). \tag{1.12} \end{align}\]

We observe in Equation (1.12) that the new prior is simply the posterior distribution based on the previous observations. This is particularly useful under the assumption of conditional independence, that is, \(Y_{t+1} \perp \mathbf{Y}_{1:t} \mid \boldsymbol{\theta}\), so that \(p(y_{t+1} \mid \mathbf{y}_{1:t}, \boldsymbol{\theta}) = p(y_{t+1} \mid \boldsymbol{\theta})\), allowing the posterior to be recovered recursively (Petris, Petrone, and Campagnoli 2009). This facilitates online updating because all information up to time \(t\) is captured in \(\boldsymbol{\theta}\). Therefore, \(\pi(\boldsymbol{\theta} \mid \mathbf{y}_{1:t+1}) \propto p(y_{t+1} \mid \boldsymbol{\theta}) \times \pi(\boldsymbol{\theta} \mid \mathbf{y}_{1:t}) \propto \prod_{h=1}^{t+1} p(y_h \mid \boldsymbol{\theta}) \times \pi(\boldsymbol{\theta})\). This recursive expression can be computed more efficiently at any specific point in time \(t\), compared to a batch-mode algorithm, which requires processing all information up to time \(t\) simultaneously.

It is also important to consider the sampling properties of “Bayesian estimators”. This topic has attracted the attention of statisticians and econometricians for a long time. For instance, asymptotic posterior concentration on the population parameter vector is discussed by Bickel and Yahav (1969). The convergence of posterior distributions is stated by the Bernstein-von Mises theorem (Lehmann and Casella 2003; Van der Vaart 2000), which establishes a link between credible intervals (sets) and confidence intervals (sets), where a credible interval is an interval in the domain of the posterior distribution within which an unknown parameter falls with a particular probability. Credible intervals treat bounds as fixed and parameters as random, whereas confidence intervals reverse this. There are many settings in parametric models where Bayesian credible intervals with an \(\alpha\) level converge asymptotically to confidence intervals at the \(\alpha\) level. This suggests that Bayesian inference is asymptotically correct from a sampling perspective in these settings.

A heuristic approach to demonstrate this in the simplest case, where we assume random sampling and \(\theta \in \mathcal{R}\), is the following: \(p(\mathbf{y} \mid \theta) = \prod_{i=1}^N p(y_i \mid \theta)\), so the log likelihood is \(l(\mathbf{y} \mid \theta) \equiv \log p(\mathbf{y} \mid \theta) = \sum_{i=1}^N \log p(y_i \mid \theta) = N \times \bar{l}(\mathbf{y} \mid \theta)\), where \(\bar{l} \equiv \frac{1}{N} \sum_{i=1}^N \log p(y_i \mid \theta)\) is the mean likelihood.⁷ Then, the posterior distribution is proportional to

\[\begin{align} \pi(\theta\mid \mathbf{y})&\propto p(\mathbf{y}\mid \theta) \times \pi(\theta)\nonumber\\ &=\exp\left\{N\times \bar{l}(\mathbf{y}\mid \theta)\right\} \times \pi(\theta). \end{align}\]

Observe that as the sample size increases, that is, as \(N \to \infty\), the exponential term should dominate the prior distribution as long as the prior does not depend on \(N\), such that the likelihood determines the posterior distribution asymptotically.

Maximum likelihood theory shows that \(\lim_{N \to \infty} \bar{l}(\mathbf{y} \mid \theta) \to \bar{l}(\mathbf{y} \mid \theta_0)\), where \(\theta_0\) is the population parameter of the data-generating process. In addition, performing a second-order Taylor expansion of the log likelihood at the maximum likelihood estimator,

\[\begin{align*} l(\mathbf{y}\mid \theta)&\approx l(\mathbf{y}\mid \hat{\theta})+\left.\frac{dl(\mathbf{y}\mid {\theta})}{d\theta}\right\vert_{\hat{\theta}}(\theta-\hat{\theta})+\frac{1}{2}\left.\frac{d^2l(\mathbf{y}\mid {\theta})}{d\theta^2}\right\vert_{\hat{\theta}}(\theta-\hat{\theta})^2\\ &= l(\mathbf{y}\mid \hat{\theta})+\frac{1}{2}\left.\sum_{i=1}^N\frac{d^2l(y_i\mid {\theta})}{d\theta^2}\right\vert_{\hat{\theta}}(\theta-\hat{\theta})^2\\ &= l(\mathbf{y}\mid \hat{\theta})-\frac{1}{2}\left.N\left[-\bar{l}''\right\vert_{\hat{\theta}}\right](\theta-\hat{\theta})^2\\ &= l(\mathbf{y}\mid \hat{\theta})-\frac{N}{2\sigma^2}(\theta-\hat{\theta})^2 \end{align*}\]

where \(\left.\frac{dl(\mathbf{y}\mid \theta)}{d\theta}\right\vert_{\hat{\theta}}=0\), \(\bar{l}''\equiv\frac{1}{N}\left.\sum_{i=1}^N\frac{d^2l(y_i\mid {\theta})}{d\theta^2}\right\vert_{\hat{\theta}}\) and \(\sigma^2:=\left[\left.-\bar{l}''\right\vert_{\hat{\theta}}\right]^{-1}\).⁸ Then,

\[\begin{align*} \pi(\theta\mid \mathbf{y})&\propto \exp\left\{{l}(\mathbf{y}\mid \theta)\right\} \times \pi(\theta)\\ &\approx \exp\left\{l(\mathbf{y}\mid \hat{\theta})-\frac{N}{2\sigma^2}(\theta-\hat{\theta})^2\right\} \times \pi(\theta)\\ &\propto \exp\left\{-\frac{N}{2\sigma^2}(\theta-\hat{\theta})^2\right\} \times \pi(\theta)\\ \end{align*}\]

Observe that the posterior density is proportional to the kernel of a normal density with mean \(\hat{\theta}\) and variance \(\sigma^2 / N\), as long as \(\pi(\hat{\theta}) \neq 0\). This kernel dominates as the sample size increases due to the \(N\) in the exponential term. It is also important to note that the prior should not exclude values of \(\theta\) that are logically possible, such as \(\hat{\theta}\).

Example: Health insurance

Suppose that you are analyzing whether to buy health insurance next year. To make a better decision, you want to know what the probability is that you will visit your doctor at least once next year? To answer this question, you have records of the number of times you have visited your doctor over the last 5 years, \(\mathbf{y} = \{0, 3, 2, 1, 0\}\). How should you proceed?

Assuming that this is a random sample⁹ from a data-generating process (statistical model) that is Poisson, i.e., \(Y_i \sim P(\lambda)\), and your probabilistic prior beliefs about \(\lambda\) are well described by a Gamma distribution with shape and scale parameters \(\alpha_0\) and \(\beta_0\), i.e., \(\lambda \sim G(\alpha_0, \beta_0)\), then you are interested in calculating the probability \(P(Y_0 > 0 \mid \mathbf{y})\). To answer this, you need to calculate the posterior predictive density \(\pi(y_0 \mid \mathbf{y})\) in a Bayesian way.

In this example, \(p(\mathbf{y} \mid \lambda)\) is Poisson, and \(\pi(\lambda)\) is Gamma. Therefore, using Equation (1.9).

\[\begin{align*} \pi(y_0\mid \mathbf{y})=&\int_{0}^{\infty}\frac{\lambda^{y_0}\exp\left\{-\lambda\right\}}{y_0!}\times \pi(\lambda\mid \mathbf{y})d\lambda,\\ \end{align*}\]

where the posterior distribution is
\[ \pi(\lambda\mid \mathbf{y})\propto \lambda^{\sum_{i=1}^N y_i + \alpha_0 - 1}\exp\left\{-\lambda\left(\frac{\beta_0 N+1}{\beta_0}\right)\right\} \] by Equation (1.3).

Observe that the last expression is the kernel of a Gamma distribution with parameters \(\alpha_n = \sum_{i=1}^N y_i + \alpha_0\) and \(\beta_n = \frac{\beta_0}{\beta_0 N + 1}\). Given that \(\int_0^{\infty} \pi(\lambda \mid \mathbf{y}) d\lambda = 1\), the constant of proportionality in the last expression is \(\Gamma(\alpha_n) \beta_n^{\alpha_n}\), where \(\Gamma(\cdot)\) is the Gamma function. Thus, the posterior density function \(\pi(\lambda \mid \mathbf{y})\) is \(G(\alpha_n, \beta_n)\).

Observe that

\[\begin{align*} \mathbb{E}[\lambda\mid \mathbf{y}]&=\alpha_n\beta_n\\ &=\left(\sum_{i=1}^N y_i + \alpha_0\right)\left(\frac{\beta_0}{\beta_0 N + 1}\right)\\ &=\bar{y}\left(\frac{N\beta_0}{N\beta_0+1}\right)+\alpha_0\beta_0\left(\frac{1}{N\beta_0+1}\right)\\ &=w\bar{y}+(1-w)\mathbb{E}[\lambda], \end{align*}\]

where \(\bar{y}\) is the sample mean estimate, which is the maximum likelihood estimate of \(\lambda\) in this example, \(w = \left(\frac{N\beta_0}{N\beta_0 + 1}\right)\), and \(\mathbb{E}[\lambda] = \alpha_0 \beta_0\) is the prior mean. The posterior mean is a weighted average of the maximum likelihood estimator (sample information) and the prior mean. Observe that \(\lim_{N \to \infty} w = 1\), that is, the sample information asymptotically dominates.

The predictive distribution is

\[\begin{align*} \pi(y_0\mid \mathbf{y})=&\int_{0}^{\infty}\frac{\lambda^{y_0}\exp\left\{-\lambda\right\}}{y_0!}\times \frac{1}{\Gamma(\alpha_n)\beta_n^{\alpha_n}}\lambda^{\alpha_n-1}\exp\left\{-\lambda/\beta_n\right\} d\lambda\\ =&\frac{1}{y_0!\Gamma(\alpha_n)\beta_n^{\alpha_n}}\int_{0}^{\infty}\lambda^{y_0+\alpha_n-1}\exp\left\{-\lambda\left(\frac{1+\beta_n}{\beta_n}\right)\right\}d\lambda\\ =&\frac{\Gamma(y_0+\alpha_n)\left(\frac{\beta_n}{\beta_n+1}\right)^{y_0+\alpha_n}}{y_0!\Gamma(\alpha_n)\beta_n^{\alpha_n}}\\ =&{y_0+\alpha_n-1 \choose y_0}\left(\frac{\beta_n}{\beta_n+1}\right)^{y_0}\left(\frac{1}{\beta_n+1}\right)^{\alpha_n}. \end{align*}\]

The third equality follows from the kernel of a Gamma density, and the fourth from
\[ {y_0 + \alpha_n - 1 \choose y_0} = \frac{(y_0 + \alpha_n - 1)(y_0 + \alpha_n - 2)\dots\alpha_n}{y_0!} = \frac{\Gamma(y_0 + \alpha_n)}{\Gamma(\alpha_n) y_0!} \] using a property of the Gamma function.

Observe that this is a Negative Binomial density, that is, \(Y_0 \mid \mathbf{y} \sim \text{NB}(\alpha_n, p_n)\) where \(p_n = \frac{\beta_n}{\beta_n + 1}\).

Up to this point, we have said nothing about the hyperparameters, which are required to give a concrete response to this exercise. Thus, we show two approaches to set them. First, we set \(\alpha_0 = 0.001\) and \(\beta_0 = \frac{1}{0.001}\), which imply vague prior information about \(\lambda\) due to having a large degree of variability compared to the mean information.¹⁰ In particular, \(\mathbb{E}[\lambda] = 1\) and \(\mathbb{V}ar[\lambda] = 1000\).

In this setting, \(P(Y_0 > 0 \mid \mathbf{y}) = 1 - P(Y_0 = 0 \mid \mathbf{y}) \approx 0.67\). That is, the probability of visiting the doctor at least once next year is approximately 0.67.

Another approach is using Empirical Bayes, where we set the hyperparameters maximizing the logarithm of the marginal likelihood,¹¹ that is, \[ \left[\hat{\alpha}_0 \ \hat{\beta}_0\right]^{\top} = \underset{\alpha_0, \beta_0}{\mathrm{argmax}} \ \ln p(\mathbf{y}) \] where \[ \begin{align} p(\mathbf{y}) &= \int_0^{\infty} \left\{ \frac{1}{\Gamma(\alpha_0)\beta_0^{\alpha_0}} \lambda^{\alpha_0 - 1} \exp\left\{-\lambda / \beta_0\right\} \prod_{i=1}^N \frac{\lambda^{y_i} \exp\left\{-\lambda\right\}}{ y_i!} \right\} d\lambda \\ &= \frac{\int_0^{\infty} \lambda^{\sum_{i=1}^N y_i + \alpha_0 - 1} \exp\left\{-\lambda \left( \frac{\beta_0 N + 1}{\beta_0} \right) \right\} d\lambda}{ \Gamma(\alpha_0) \beta_0^{\alpha_0} \prod_{i=1}^N y_i! } \\ &= \frac{\Gamma\left(\sum_{i=1}^N y_i + \alpha_0\right) \left( \frac{\beta_0}{N\beta_0 + 1} \right)^{\sum_{i=1}^N y_i} \left( \frac{1}{N\beta_0 + 1} \right)^{\alpha_0}}{ \Gamma(\alpha_0) \prod_{i=1}^N y_i } \end{align} \]

Using the empirical Bayes approach, we get \(\hat{\alpha}_0 = 51.8\) and \(\hat{\beta}_0 = 0.023\), then \(P(Y_0 > 0 \mid \mathbf{y}) = 1 - P(Y_0 = 0 \mid \mathbf{y}) \approx 0.70\).

Observe that we can calculate the posterior odds comparing the model using an Empirical Bayes prior (model 1) versus the vague prior (model 2). We assume that \(\pi(\mathcal{M}_1) = \pi(\mathcal{M}_2) = 0.5\), then \[ \begin{align} PO_{12} &= \frac{p(\mathbf{y} \mid \text{Empirical Bayes})}{ p(\mathbf{y} \mid \text{Vague prior}) } \\ &= \frac{\frac{\Gamma\left(\sum_{i=1}^N y_i + 51.807\right) \left( \frac{0.023}{N \times 0.023 + 1} \right)^{\sum_{i=1}^N y_i} \left( \frac{1}{N \times 0.023 + 1} \right)^{51.807}}{\Gamma(51.807)}}{\frac{\Gamma\left(\sum_{i=1}^N y_i + 0.001\right) \left( \frac{1/0.001}{N/0.001 + 1} \right)^{\sum_{i=1}^N y_i} \left( \frac{1}{N/0.001 + 1} \right)^{0.001}}{\Gamma(0.001)}} \\ &\approx 919 \end{align} \]

Then, \(2 \times \log(PO_{12}) = 13.64\), which provides very strong evidence against the vague prior model (see Table 1.1). In particular, \(\pi(\text{Empirical Bayes} \mid \mathbf{y}) = \frac{919}{1 + 919} = 0.999\) and \(\pi(\text{Vague prior} \mid \mathbf{y}) = 1 - 0.999 = 0.001\). These probabilities can be used to perform Bayesian model averaging (BMA). In particular, \[ \begin{align} \mathbb{E}(\lambda \mid \mathbf{y}) &= 1.2 \times 0.999 + 1.2 \times 0.001 = 1.2 \\ \text{Var}(\lambda \mid \mathbf{y}) &= 0.025 \times 0.999 + 0.24 \times 0.001 \\ &+ (1.2 - 1.2)^2 \times 0.999 + (1.2 - 1.2)^2 \times 0.001 = 0.025 \end{align} \]

The BMA predictive distribution is a mix of negative binomial distributions, that is, \[ Y_0 \mid \mathbf{y} \sim 0.999 \times \text{NB}(57.8, 0.02) + 0.001 \times \text{NB}(6.001, 0.17) \]

The following code shows how to perform this exercise in R.

set.seed(010101)
y <- c(0, 3, 2, 1, 0) # Data
N <- length(y)
ProbBo <- function(y, a0, b0){
    N <- length(y)
    #sample size
    an <- a0 + sum(y) 
    # Posterior shape parameter
    bn <- b0 / ((b0 * N) + 1) 
    # Posterior scale parameter
    p <- bn / (bn + 1) 
    # Probability negative binomial density
    Pr <- 1 - pnbinom(0, size=an,prob=(1 - p)) 
    # Probability of visiting the Doctor at least once next year
    # Observe that in R there is a slightly different parametrization.
    return(Pr)
} 
# Using a vague prior:
a0 <- 0.001 # Prior shape parameter
b0 <- 1 / 0.001 # Prior scale parameter
PriMeanV <- a0 * b0 # Prior mean
PriVarV <- a0 * b0^2 # Prior variance
Pp <- ProbBo(y, a0 = 0.001, b0 = 1 / 0.001) 
# This setting is vague prior information.
Pp

## [1] 0.6650961

# Using Empirical Bayes
LogMgLik <- function(theta, y){
 N <- length(y) 
 #sample size
 a0 <- theta[1] 
 # prior shape hyperparameter
 b0 <- theta[2] 
 # prior scale hyperparameter
 an <- sum(y) + a0 
 # posterior shape parameter
 if(a0 <= 0 || b0 <= 0){ 
  #Avoiding negative values
  lnp <- -Inf
  }else{
  lnp <- lgamma(an) + sum(y)*log(b0/(N*b0+1)) - a0*log(N*b0+1) - lgamma(a0)
 } 
 # log marginal likelihood
 return(-lnp)
}           
theta0 <- c(0.01, 1/0.1) 
# Initial values
control <- list(maxit = 1000) 
# Number of iterations in optimization
EmpBay <- optim(theta0, LogMgLik, method = "BFGS", control = control, hessian = TRUE, y = y) 
# Optimization
EmpBay$convergence

## [1] 0

a0EB <- EmpBay$par[1] 
# Prior shape using empirical Bayes
a0EB

## [1] 51.80696

b0EB <- EmpBay$par[2] 
# Prior scale using empirical Bayes
b0EB

## [1] 0.02318341

PriMeanEB <- a0EB * b0EB 
# Prior mean
PriVarEB <- a0EB * b0EB^2 
# Prior variance
PpEB <- ProbBo(y, a0 = a0EB, b0 = b0EB) 
# This setting is using emprical Bayes.
PpEB

## [1] 0.6953668

# Density figures: 
# This code helps plotting densities
lambda <- seq(0.01, 10, 0.01) 
# Values of lambda
VaguePrior <- dgamma(lambda,shape=a0,scale = b0)
EBPrior <- dgamma(lambda,shape=a0EB,scale = b0EB)
PosteriorV <- dgamma(lambda, shape = a0 + sum(y), scale = b0 / ((b0 * N) + 1)) 
PosteriorEB <- dgamma(lambda, shape = a0EB+sum(y), scale = b0EB / ((b0EB * N) + 1))         
# Likelihood function
Likelihood <- function(theta, y){
 LogL <- dpois(y, theta, log = TRUE)
 Lik <- prod(exp(LogL))
 return(Lik)
}
Liks <- sapply(lambda, function(par) {Likelihood(par, y = y)})
Sc <- max(PosteriorEB)/max(Liks) 
#Scale for displaying in figure
LiksScale <- Liks * Sc
data <- data.frame(cbind(lambda, VaguePrior, EBPrior, PosteriorV, PosteriorEB, LiksScale)) #Data frame
require(ggplot2) # Cool figures

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 4.3.3

require(latex2exp) # LaTeX equations in figures

## Loading required package: latex2exp

require(ggpubr) # Multiple figures in one page

## Loading required package: ggpubr

fig1 <- ggplot(data = data, aes(lambda, VaguePrior)) + geom_line() +    xlab(TeX("$\\lambda$")) + ylab("Density") + ggtitle("Prior: Vague Gamma") 
fig2 <- ggplot(data = data, aes(lambda, EBPrior)) + geom_line() + xlab(TeX("$\\lambda$")) + ylab("Density") + ggtitle("Prior: Empirical Bayes Gamma")
fig3 <- ggplot(data = data, aes(lambda, PosteriorV)) + geom_line() + xlab(TeX("$\\lambda$")) + ylab("Density") + ggtitle("Posterior: Vague Gamma")
fig4 <- ggplot(data = data, aes(lambda, PosteriorEB)) + geom_line() + xlab(TeX("$\\lambda$")) + ylab("Density") + ggtitle("Posterior: Empirical Bayes Gamma")
FIG <- ggarrange(fig1, fig2, fig3, fig4, ncol = 2, nrow = 2)
annotate_figure(FIG, top = text_grob("Vague versus Empirical Bayes: Poisson-Gamma model", color = "black", face = "bold", size = 14))

dataNew <- data.frame(cbind(rep(lambda, 3), c(EBPrior, PosteriorEB, LiksScale), rep(1:3, each = 1000))) #Data frame
colnames(dataNew) <- c("Lambda", "Density", "Factor")
dataNew$Factor <- factor(dataNew$Factor, levels=c("1", "3", "2"), 
labels=c("Prior", "Likelihood", "Posterior"))
# ggplot(data = dataNew, aes_string(x = "Lambda", y = "Density", group = "Factor")) + geom_line(aes(color = Factor)) + xlab(TeX("$\\lambda$")) + ylab("Density") + ggtitle("Prior, likelihood and posterior: Empirical Bayes Poisson-Gamma model") + guides(color=guide_legend(title="Information")) + scale_color_manual(values = c("red", "yellow", "blue"))

ggplot(data = dataNew, aes(x = Lambda, y = Density, group = Factor, color = Factor)) +
  geom_line() +
  xlab(TeX("$\\lambda$")) +
  ylab("Density") +
  ggtitle("Prior, likelihood and posterior: Empirical Bayes Poisson-Gamma model") +
  guides(color = guide_legend(title = "Information")) +
  scale_color_manual(values = c("red", "yellow", "blue"))

The first figure displays the prior and posterior densities based on vague and Empirical Bayes hyperparameters. We observe that the prior and posterior densities using the latter are more informative, as expected.

The second figure shows the prior, scaled likelihood, and posterior densities of \(\lambda\) based on the hyperparameters from the Empirical Bayes approach. The posterior density is a compromise between prior and sample information.

# Predictive distributions
PredDen <- function(y, y0, a0, b0){
    N <- length(y)
    #sample size
    an <- a0 + sum(y) 
    # Posterior shape parameter
    bn <- b0 / ((b0 * N) + 1) 
    # Posterior scale parameter
    p <- bn / (bn + 1) 
    # Probability negative binomial density
    Pr <- dnbinom(y0, size=an, prob=(1 - p))
    # Predictive density
    # Observe that in R there is a slightly different parametrization.
    return(Pr)
}
y0 <- 0:10
PredVague <- PredDen(y=y, y0=y0, a0=a0, b0=b0)
PredEB <- PredDen(y=y, y0=y0, a0=a0EB, b0=b0EB)
dataPred <- as.data.frame(cbind(y0, PredVague, PredEB))
colnames(dataPred) <- c("y0", "PredictiveVague", "PredictiveEB")
ggplot(data = dataPred) + geom_point(aes(y0, PredictiveVague, color = "red")) +  
xlab(TeX("$y_0$")) + ylab("Density") + ggtitle("Predictive density: Vague and Empirical Bayes priors") + geom_point(aes(y0, PredictiveEB, color = "yellow")) +
guides(color = guide_legend(title="Prior")) + scale_color_manual(labels = c("Vague", "Empirical Bayes"), values = c("red", "yellow")) + scale_x_continuous(breaks=seq(0,10,by=1))

This figure displays the predictive probability mass of not having any visits to a physician next year, as well as having one, two, and so on, using Empirical Bayes and vague hyperparameters. The predictive probabilities of not having any visits are approximately 30% and 33% based on the Empirical Bayes and vague hyperparameters, respectively.

# Posterior odds: Vague vs Empirical Bayes
PO12 <- exp(-LogMgLik(c(a0EB, b0EB), y = y))/exp(-LogMgLik(c(a0, b0), y = y))
PO12

## [1] 919.0069

PostProMEM <- PO12/(1 + PO12) 
PostProMEM

## [1] 0.9989131

# Posterior model probability Empirical Bayes
PostProbMV <- 1 - PostProMEM 
PostProbMV

## [1] 0.001086948

# Posterior model probability vague prior
# Bayesian model average (BMA)
PostMeanEB <- (a0EB + sum(y)) * (b0EB / (b0EB * N + 1)) 
# Posterior mean Empirical Bayes 
PostMeanV <- (a0 + sum(y)) * (b0 / (b0 * N + 1)) 
# Posterior mean vague priors
BMAmean <- PostProMEM * PostMeanEB + PostProbMV * PostMeanV  
BMAmean

## [1] 1.200951

# BMA posterior mean
PostVarEB <- (a0EB + sum(y)) * (b0EB/(b0EB * N + 1))^2 
# Posterior variance Empirical Bayes
PostVarV <- (a0 + sum(y)) * (b0 / (b0 * N + 1))^2 
# Posterior variance vague prior 
BMAVar <- PostProMEM * PostVarEB + PostProbMV*PostVarV + PostProMEM * (PostMeanEB - BMAmean)^2 + PostProbMV * (PostMeanV - BMAmean)^2
# BMA posterior variance   
BMAVar

## [1] 0.02518372

# BMA: Predictive
BMAPred <- PostProMEM * PredEB+PostProbMV * PredVague
dataPredBMA <- as.data.frame(cbind(y0, BMAPred))
colnames(dataPredBMA) <- c("y0", "PredictiveBMA")
ggplot(data = dataPredBMA) + geom_point(aes(y0, PredictiveBMA, color = "red")) +  xlab(TeX("$y_0$")) + ylab("Density") + ggtitle("Predictive density: BMA") + guides(color = guide_legend(title="BMA")) + scale_color_manual(labels = c("Probability"), values = c("red")) + scale_x_continuous(breaks=seq(0,10,by=1))

The first figure displays the predictive density using Bayesian model averaging based on the vague and Empirical Bayes hyperparameters. This figure closely resembles the predictive probability mass function based on the Empirical Bayes framework, as the posterior model probability for that setting is nearly one.

The second figure shows how the posterior distribution updates with new sample information, starting from an initial non-informative prior (iteration 1). We observe that iteration 5 incorporates all the sample information in our example. As a result, the posterior density in iteration 5 is identical to the posterior density.

References

Bayarri, M., and J. Berger. 2000. “P–Values for Composite Null Models.” Journal of American Statistical Association 95: 1127–42.

Bickel, Peter J, and Joseph A Yahav. 1969. “Some Contributions to the Asymptotic Theory of Bayes Solutions.” Zeitschrift für Wahrscheinlichkeitstheorie Und Verwandte Gebiete 11 (4): 257–76.

Casella, George, and Roger Berger. 2024. Statistical Inference. CRC Press.

Gelman, A., and X. Meng. 1996. “Model Checking and Model Improvement.” In In Markov Chain Monte Carlo in Practice, edited by Gilks, Richardson, and Speigelhalter. Springer US.

Gelman, A., X. Meng, and H. Stern. 1996. “Posterior Predictive Assessment of Model Fitness via Realized Discrepancies.” Statistica Sinica, 733–60.

Gelman, Andrew et al. 2006. “Prior Distributions for Variance Parameters in Hierarchical Models (Comment on Article by Browne and Draper).” Bayesian Analysis 1 (3): 515–34.

Kass, R E, and A E Raftery. 1995. “Bayes factors.” Journal of the American Statistical Association 90 (430): 773–95.

Lehmann, E. L., and George Casella. 2003. Theory of Point Estimation. Second Edition. Springer.

Petris, Giovanni, Sonia Petrone, and Patrizia Campagnoli. 2009. “Dynamic Linear Models.” In Dynamic Linear Models with r, 31–84. Springer.

Van der Vaart, Aad W. 2000. Asymptotic Statistics. Vol. 3. Cambridge university press.

Wooldridge, Jeffrey M. 2010. Econometric Analysis of Cross Section and Panel Data. MIT press.

From a Bayesian perspective, \(\boldsymbol{\theta}\) is fixed but unknown. Then, it is treated as a random object despite the lack of variability (see Chapter 2).↩︎
M. Bayarri and Berger (2000) show potential issues due to using data twice in the construction of the predictive p-values. They also present alternative proposals, for instance, the partial posterior predictive p-value.↩︎
Note that in the likelihood function the argument is \(\theta\), but we keep the notation for convenience in exposition.↩︎
The last definition follows from standard theory in maximum likelihood estimation (see Casella and Berger (2024) and Jeffrey M. Wooldridge (2010)).↩︎
Independent and identically distributed draws.↩︎
We should be aware that there may be technical problems using this kind of hyperparameters in this setting (Andrew Gelman et al. 2006)↩︎
Empirical Bayes methods are criticized due to double-using the data. First to set the hyperparameters, and second, to perform Bayesian inference.↩︎