1.3 Bayesian reports: Decision theory under uncertainty

The Bayesian framework allows reporting the full posterior distributions. However, some situations require reporting a specific value of the posterior distribution (point estimate), an informative interval (set), point or interval predictions, and/or selecting a specific model. Decision theory offers an elegant framework to make decisions regarding the optimal posterior values to report (J. O. Berger 2013).

The starting point is a loss function, which is a non-negative real-valued function whose arguments are the unknown state of nature (\(\mathbf{\Theta}\)), and a set of actions to be taken (\(\mathcal{A}\)), that is, \[\begin{equation*} L(\mathbf{\theta}, a):\mathbf{\Theta}\times \mathcal{A}\rightarrow \mathcal{R}^+. \end{equation*}\]

This function is a mathematical representation of the loss incurred from making mistakes. In particular, selecting action \(a\in\mathcal{A}\) when \(\mathbf{\theta}\in\mathbf{\Theta}\) is the true state. In our case, the unknown state of nature can refer to parameters, functions of them, future or unknown realizations, models, etc.

From a Bayesian perspective, we should choose the action that minimizes the posterior expected loss (\(a^*(\mathbf{y})\)), that is, the posterior risk function (\(\mathbb{E}[L(\mathbf{\theta}, a)\mid \mathbf{y}]\)), \[\begin{equation*} a^*(\mathbf{y})=\underset{a \in \mathcal{A}}{\mathrm{argmin}} \ \mathbb{E}[L(\mathbf{\theta}, a)\mid \mathbf{y}], \end{equation*}\] where \(\mathbb{E}[L(\mathbf{\theta}, a)\mid \mathbf{y}] = \int_{\mathbf{\Theta}} L(\mathbf{\theta}, a)\pi(\mathbf{\theta}\mid \mathbf{y})d\mathbf{\theta}\).12

Different loss functions imply different optimal decisions. We illustrate this assuming \(\theta \in \mathcal{R}\).

  • The quadratic loss function, \(L(\theta,a)=[\theta-a]^2\), gives as the optimal decision the posterior mean, \(a^*(\mathbf{y})=\mathbb{E}[\theta \mid \mathbf{y}]\), that is:

\[\begin{equation*} \mathbb{E}[\theta \mid \mathbf{y}] = \underset{a \in \mathcal{A}}{\mathrm{argmin}} \ \int_{\Theta} [\theta - a]^2 \pi(\theta \mid \mathbf{y}) \, d\theta. \end{equation*}\]

To obtain this result, let’s use the first-order condition, differentiate the risk function with respect to \(a\), interchange the differential and integral order, and set the result equal to zero:

\[ -2 \int_{\Theta} [\theta - a^*] \pi(\theta \mid \mathbf{y}) \, d\theta = 0. \]

This implies that

\[ a^* \int_{\Theta} \pi(\theta \mid \mathbf{y}) \, d\theta = a^*(\mathbf{y}) = \int_{\Theta} \theta \pi(\theta \mid \mathbf{y}) \, d\theta = \mathbb{E}[\theta \mid \mathbf{y}], \]

that is, the posterior mean is the Bayesian optimal action. This means that we should report the posterior mean as a point estimate of \(\theta\) when facing the quadratic loss function.

  • The generalized quadratic loss function, \(L(\theta,a) = w(\theta) [\theta - a]^2\), where \(w(\theta) > 0\) is a weighting function, gives as the optimal decision rule the weighted mean. We should follow the same steps as the previous result to obtain

\[ a^*(\mathbf{y}) = \frac{\mathbb{E}[w(\theta) \times \theta \mid \mathbf{y}]}{\mathbb{E}[w(\theta) \mid \mathbf{y}]}. \]

Observe that the weighted average is driven by the weighting function \(w(\theta)\).

  • The absolute error loss function, \(L(\theta,a) = |\theta - a|\), gives as the optimal action the posterior median (Exercise 5).

  • The generalized absolute error function,

\[ L(\theta,a) = \begin{cases} K_0 (\theta - a), & \text{if } \theta - a \geq 0, \\ K_1 (a - \theta), & \text{if } \theta - a < 0, \end{cases} \quad K_0, K_1 > 0, \]

implies the following risk function:

\[\begin{align*} \mathbb{E}[L(\theta, a) \mid \mathbf{y}] &= \int_{-\infty}^{a} K_1(a - \theta) \pi(\theta \mid \mathbf{y}) \, d\theta + \int_{a}^{\infty} K_0 (\theta - a) \pi(\theta \mid \mathbf{y}) \, d\theta. \end{align*}\]

Differentiating with respect to \(a\), interchanging differentials and integrals, and equating to zero, we get:

\[\begin{align*} K_1 \int_{-\infty}^{a^*} \pi(\theta \mid \mathbf{y}) \, d\theta - K_0 \int_{a^*}^{\infty} \pi(\theta \mid \mathbf{y}) \, d\theta &= 0. \end{align*}\]

Thus, we have

\[ \int_{-\infty}^{a^*} \pi(\theta \mid \mathbf{y}) \, d\theta = \frac{K_0}{K_0 + K_1}, \]

that is, any \(\frac{K_0}{K_0 + K_1}\)-percentile of \(\pi(\theta \mid \mathbf{y})\) is an optimal Bayesian estimate of \(\theta\).

We can also use decision theory under uncertainty in hypothesis testing. In particular, testing \(H_0: \theta \in \Theta_0\) versus \(H_1: \theta \in \Theta_1\), where \(\Theta = \Theta_0 \cup \Theta_1\) and \(\emptyset = \Theta_0 \cap \Theta_1\), there are two actions of interest, \(a_0\) and \(a_1\), where \(a_j\) denotes not rejecting \(H_j\), for \(j = \{0,1\}\).

Given the \(0-K_j\) loss function:

\[\begin{equation*} L(\theta,a_j) = \begin{cases} 0, & \text{if } \theta \in \Theta_j, \\ K_j, & \text{if } \theta \in \Theta_i, j \neq i, \end{cases} \end{equation*}\]

where there is no loss if the right decision is made, for instance, not rejecting \(H_0\) when \(\theta \in \Theta_0\), and the loss is \(K_j\) when an error is made. For example, a type I error occurs when rejecting the null hypothesis (\(H_0\)) when it is true (\(\theta \in \Theta_0\)), which results in a loss of \(K_1\) due to choosing action \(a_1\), not rejecting \(H_1\).

The posterior expected loss associated with decision \(a_j\), i.e., not rejecting \(H_j\), is:

\[ \mathbb{E}[L(\theta,a_j) \mid \mathbf{y}] = 0 \times P(\Theta_j \mid \mathbf{y}) + K_j P(\Theta_i \mid \mathbf{y}) = K_j P(\Theta_i \mid \mathbf{y}), \quad j \neq i. \]

Therefore, the Bayes optimal decision is the one that minimizes the posterior expected loss. That is, the null hypothesis is rejected (\(a_1\) is not rejected) when

\[ K_0 P(\Theta_1 \mid \mathbf{y}) > K_1 P(\Theta_0 \mid \mathbf{y}). \]

Given our framework, \(\Theta = \Theta_0 \cup \Theta_1\) and \(\emptyset = \Theta_0 \cap \Theta_1\), we have \(P(\Theta_0 \mid \mathbf{y}) = 1 - P(\Theta_1 \mid \mathbf{y})\). As a result, the rejection region of the Bayesian test is:

\[ R = \left\{ \mathbf{y} : P(\Theta_1 \mid \mathbf{y}) > \frac{K_1}{K_1 + K_0} \right\}. \]

Decision theory also helps to construct interval (region) estimates. Let \(\Theta_{C(\mathbf{y})} \subset \Theta\) be a credible set for \(\theta\), and let the loss function be defined as:

\[ L(\theta, \Theta_{C(\mathbf{y})}) = 1 - \mathbf{1}\left\{\theta \in \Theta_{C(\mathbf{y})}\right\}, \]

where

\[ \mathbf{1}\left\{\theta \in \Theta_{C(\mathbf{y})}\right\} = \begin{cases} 1, & \text{if } \theta \in \Theta_{C(\mathbf{y})}, \\ 0, & \text{if } \theta \notin \Theta_{C(\mathbf{y})}. \end{cases} \]

Thus, the loss function becomes:

\[ L(\theta, \Theta_{C(\mathbf{y})}) = \begin{cases} 0, & \text{if } \theta \in \Theta_{C(\mathbf{y})}, \\ 1, & \text{if } \theta \notin \Theta_{C(\mathbf{y})}. \end{cases} \]

This is a 0-1 loss function, which equals zero when \(\theta \in \Theta_{C(\mathbf{y})}\) and equals one when \(\theta \notin \Theta_{C(\mathbf{y})}\). Consequently, the risk function is:

\[ 1 - P(\theta \in \Theta_{C(\mathbf{y})}). \]

Given a measure of credibility \(\alpha(\mathbf{y})\) that defines the level of trust that \(\theta \in \Theta_{C(\mathbf{y})}\), we can measure the accuracy of the report by the loss function:

\[ L(\theta, \alpha(\mathbf{y})) = \left[\mathbf{1}\left\{\theta \in \Theta_{C(\mathbf{y})}\right\} - \alpha(\mathbf{y})\right]^2. \]

This loss function could be used to suggest a choice of the report \(\alpha(\mathbf{y})\). Given that this is a quadratic loss function, the optimal action is the posterior mean, that is,

\[ \mathbb{E}[\mathbf{1}\left\{\theta \in \Theta_{C(\mathbf{y})}\right\} \mid \mathbf{y}] = P(\theta \in \Theta_{C(\mathbf{y})} \mid \mathbf{y}). \]

This probability can be calculated given the posterior distribution as

\[ P(\theta \in \Theta_{C(\mathbf{y})} \mid \mathbf{y}) = \int_{\Theta_{C(\mathbf{y})}} \pi(\theta \mid \mathbf{y}) \, d\theta. \]

This represents a measure of the belief that \(\theta \in \Theta_{C(\mathbf{y})}\) given the prior beliefs and sample information.

The set \(\Theta_{C(\mathbf{y})} \subset \Theta\) is a \(100(1 - \alpha)\%\) credible set with respect to \(\pi(\theta \mid \mathbf{y})\) if

\[ P(\theta \in \Theta_{C(\mathbf{y})} \mid \mathbf{y}) = \int_{\Theta_{C(\mathbf{y})}} \pi(\theta \mid \mathbf{y}) \, d\theta = 1 - \alpha. \]

Two alternatives for reporting credible sets are the symmetric credible set and the highest posterior density set (HPD). The former is based on the \(\frac{\alpha}{2}\%\) and \((1 - \frac{\alpha}{2})\%\) percentiles of the posterior distribution, and the latter is a \(100(1 - \alpha)\%\) credible interval for \(\theta\) with the property that it has the smallest distance compared to any other \(100(1 - \alpha)\%\) credible interval for \(\theta\) based on the posterior distribution. Specifically,

\[ C(\mathbf{y}) = \left\{ \theta : \pi(\theta \mid \mathbf{y}) \geq k(\alpha) \right\}, \]

where \(k(\alpha)\) is the largest number such that

\[ \int_{\theta : \pi(\theta \mid \mathbf{y}) \geq k(\alpha)} \pi(\theta \mid \mathbf{y}) d\theta = 1 - \alpha. \]

The HPD set can be a collection of disjoint intervals when working with multimodal posterior densities. Additionally, HPD sets have the limitation of not necessarily being invariant under transformations.

Decision theory can also be used to perform prediction (point, sets, or probabilistic). Suppose that there is a loss function \(L(Y_0, a)\) involving the prediction of \(Y_0\). Then, the expected loss is

\[ \mathbb{E}_{Y_0}[L(Y_0, a)] = \int_{\mathcal{Y}_0} L(y_0, a) \pi(y_0 \mid \mathbf{y}) \, dy_0, \]

where \(\pi(y_0 \mid \mathbf{y})\) is the predictive density function. Thus, we make an optimal choice for prediction that minimizes the risk function given a specific loss function.

Although Bayesian Model Averaging (BMA) allows for incorporating model uncertainty in a regression framework, sometimes it is desirable to select just one model. A compelling alternative is to choose the model with the highest posterior model probability. This model is the best alternative for prediction in the case of a 0-1 loss function (Clyde and George 2004).

Example: Health insurance continues

We show some optimal rules in the health insurance example, specifically the best point estimates of \(\lambda\) under the quadratic, absolute, and generalized absolute loss functions. For the generalized absolute loss function, we assume that underestimating \(\lambda\) is twice as costly as overestimating it, i.e., \(K_0 = 2\) and \(K_1 = 1\).

Given that the posterior distribution of \(\lambda\) is \(G(\alpha_0 + \sum_{i=1}^N y_i, \frac{\beta_0}{\beta_0 N + 1})\), and using the hyperparameters from empirical Bayes, we obtain the following optimal point estimates:

  • The posterior mean: \(\mathbb{E}[\lambda \mid \mathbf{y}] = \alpha_n \beta_n = 1.2\),
  • The posterior median: 1.19,
  • The 2/3-th quantile: 1.26.

These are the optimal point estimates for the quadratic, absolute, and generalized absolute loss functions, respectively.

In addition, we test the null hypothesis \(H_0: \lambda \in [0, 1)\) versus the alternative hypothesis \(H_1: \lambda \in [1, \infty)\), setting \(K_0 = K_1 = 1\). We should reject the null hypothesis since \(P(\lambda \in [0, 1)) = 0.9 > \frac{K_1}{K_0 + K_1} = 0.5\).

The 95% symmetric credible interval is \((0.91, 1.53)\), and the highest posterior density (HPD) interval is \((0.90, 1.51)\). Finally, the optimal point prediction under the quadratic loss function is 1.2, which is the mean value of the posterior predictive distribution. The optimal model, assuming a 0-1 loss function, is the model using the hyperparameters from the empirical Bayes procedure, since the posterior model probability of this model is approximately 1, whereas the posterior model probability of the model using vague hyperparameters is approximately 0.

an <- sum(y) + a0EB 
# Posterior shape parameter
bn <- b0EB / (N*b0EB + 1) 
# Posterior scale parameter
S <- 1000000 
# Number of posterior draws
Draws <- rgamma(1000000, shape = an, scale = bn) 
# Posterior draws
###### Point estimation ########
OptQua <- an*bn 
# Mean: Optimal choice quadratic loss function
OptQua
## [1] 1.200952
OptAbs <- qgamma(0.5, shape = an, scale = bn) 
# Median: Optimal choice absolute loss function
OptAbs
## [1] 1.194034
# Setting K0 = 2 and K1 = 1, that is, to underestimate lambda is twice as costly as to overestimate it.
K0 <- 2; K1 <- 1
OptGenAbs <- quantile(Draws, K0/(K0 + K1)) 
# Median: Optimal choice generalized absolute loss function
OptGenAbs
## 66.66667% 
##  1.263182
###### Hypothesis test ########
# H0: lambda in [0,1) vs H1: lambda in [1, Inf]
K0 <- 1; K1 <- 1
ProbH0 <- pgamma(1, shape = an, scale = bn) 
ProbH0 # Posterior  probability H0
## [1] 0.09569011
ProbH1 <- 1 -ProbH0
ProbH1 # Posterior  probability H1
## [1] 0.9043099
# We should reject H0 given ProbH1 > K1 / (K0 + K1) 
###### Credible intervals ########
LimInf <- qgamma(0.025, shape = an, scale = bn) # Lower bound
LimInf
## [1] 0.9114851
LimSup <- qgamma(0.975, shape = an, scale = bn) # Upper bound
LimSup
## [1] 1.529724
HDI <- HDInterval::hdi(Draws, credMass = 0.95) # Highest posterior density credible interval
HDI
##     lower     upper 
## 0.9007934 1.5163109 
## attr(,"credMass")
## [1] 0.95
###### Predictive optimal choices ########
p <- bn / (bn + 1) # Probability negative binomial density
OptPred <- p/(1-p)*an # Optimal point prediction given a quadratic loss function in prediction
OptPred
## [1] 1.200952

References

Berger, James O. 2013. Statistical Decision Theory and Bayesian Analysis. Springer Science & Business Media.
Chernozhukov, V., and H. Hong. 2003. “An MCMC Approach to Classical Estimation.” Journal of Econometrics 115: 293–346.
Clyde, M., and E. George. 2004. “Model Uncertatinty.” Statistical Science 19 (1): 81–94.

  1. Chernozhukov and Hong (2003) propose Laplace-type estimators (LTE) based on the quasi-posterior, \(p(\mathbf{\theta})=\frac{\exp\left\{L_n(\mathbf{\theta})\right\}\pi(\mathbf{\theta})}{\int_{\mathbf{\Theta}}\exp\left\{L_n(\mathbf{\theta})\right\}\pi(\mathbf{\theta})d\theta}\), where \(L_n(\mathbf{\theta})\) is not necessarily a log-likelihood function. The LTE minimizes the quasi-posterior risk.↩︎