2.3 Estimation, hypothesis testing and prediction

All that is required to perform estimation, hypothesis testing (model selection), and prediction in the Bayesian approach is to apply Bayes’ rule. This ensures coherence under a probabilistic view. However, there is no free lunch: coherence reduces flexibility. On the other hand, the Frequentist approach may not be coherent from a probabilistic point of view, but it is highly flexible. This approach can be seen as a toolkit that offers inferential solutions under the umbrella of understanding probability as relative frequency. For instance, a point estimator in a Frequentist approach is found such that it satisfies good sampling properties like unbiasedness, efficiency, or a large sample property such as consistency.

A notable difference is that optimal Bayesian decisions are calculated by minimizing the expected value of the loss function with respect to the posterior distribution, i.e., conditional on observed data. In contrast, Frequentist “optimal” actions are based on the expected values over the distribution of the estimator (a function of data), conditional on the unknown parameters. This involves considering sampling variability.

The Bayesian approach allows for the derivation of the posterior distribution of any unknown object, such as parameters, latent variables, future or unobserved variables, or models. A major advantage is that predictions can account for estimation error, and predictive distributions (probabilistic forecasts) can be easily derived.

Hypothesis testing (model selection) in the Bayesian framework is based on inductive logic reasoning (inverse probability). Based on observed data, we evaluate which hypothesis is most tenable, performing this evaluation using posterior odds. These odds are in turn based on Bayes factors, which assess the evidence in favor of a null hypothesis while explicitly considering the alternative (R. E. Kass and Raftery 1995), following the rules of probability (D. V. Lindley 2000). This approach compares how well hypotheses predict data (Goodman 1999), minimizes the weighted sum of type I and type II error probabilities (DeGroot 1975; Pericchi and Pereira 2015), and takes into account the implicit balance of losses (Jeffreys 1961; Bernardo and Smith 1994). Posterior odds allow for the use of the same framework to analyze nested and non-nested models and perform model averaging.

However, Bayes factors cannot be based on improper or vague priors (Koop 2003), the practical interplay between model selection and posterior distributions is not as straightforward as it may be in the Frequentist approach, and the computational burden can be more demanding due to the need to solve potentially difficult integrals.

On the other hand, the Frequentist approach establishes most of its estimators as the solution to a system of equations. Observe that optimization problems often reduce to solving systems. We can potentially obtain the distribution of these estimators, but most of the time, asymptotic arguments or resampling techniques are required. Hypothesis testing relies on pivotal quantities and/or resampling, and prediction is typically based on a plug-in approach, which means that estimation error is not taken into account.¹⁴ In addition, ancillary statistics can be used to build prediction intervals.¹⁵

Comparing models depends on their structure. For instance, there are different Frequentist statistical approaches to compare nested and non-nested models. A nice feature in some situations is that there is a practical interplay between hypothesis testing and confidence intervals. For example, in the normal population mean hypothesis framework, you cannot reject a null hypothesis \(H_0: \mu = \mu^0\) at the \(\alpha\) significance level (Type I error) if \(\mu^0\) is in the \(1-\alpha\) confidence interval. Specifically,

\[ P\left( \mu \in \left[\hat{\mu} - |t_{N-1}^{\alpha/2}| \times \hat{\sigma}_{\hat{\mu}}, \hat{\mu} + |t_{N-1}^{\alpha/2}| \times \hat{\sigma}_{\hat{\mu}}\right] \right) = 1 - \alpha, \]

where \(\hat{\mu}\) and \(\hat{\sigma}_{\hat{\mu}}\) are the maximum likelihood estimators of the mean and standard error, \(t_{N-1}^{\alpha/2}\) is the quantile value of the Student’s \(t\)-distribution at the \(\alpha/2\) probability level with \(N-1\) degrees of freedom, and \(N\) is the sample size.

A remarkable difference between the Bayesian and Frequentist inferential frameworks is the interpretation of credible/confidence intervals. Observe that once we have estimates, such that, for example, the previous interval is \([0.2, 0.4]\) given a 95% confidence level, we cannot say that \(P(\mu \in [0.2, 0.4]) = 0.95\) in the Frequentist framework. In fact, this probability is either 0 or 1 in this approach, as \(\mu\) is either in the interval or it is not. The problem is that we will never know for certain in applied settings. This is because

\[ P(\mu \in [\hat{\mu} - |t_{N-1}^{0.025}| \times \hat{\sigma}_{\hat{\mu}}, \hat{\mu} + |t_{N-1}^{0.025}| \times \hat{\sigma}_{\hat{\mu}}]) = 0.95 \]

is interpreted in the context of repeated sampling. On the other hand, once we have the posterior distribution in the Bayesian framework, we can say that \(P(\mu \in [0.2, 0.4]) = 0.95\).

Following common practice, most researchers and practitioners conduct hypothesis testing based on the p-value in the Frequentist framework. But what is a p-value? Most users do not know the answer, as statistical inference is often not performed by statisticians (J. Berger 2006).¹⁶ A p-value is the probability of obtaining a statistical summary of the data equal to or more extreme than what was actually observed, assuming that the null hypothesis is true.

Therefore, p-value calculations involve not just the observed data, but also more extreme hypothetical observations. Thus,

“What the use of p implies, therefore, is that a hypothesis that may be true may be rejected because it has not predicted observable results that have not occurred.” (Jeffreys 1961)

It seems that common Frequentist inferential practice intertwines two different logical reasoning arguments: the p-value (Fisher 1958) and the significance level (Neyman and Pearson 1933). The former is an informal short-run criterion, whose philosophical foundation is reduction to absurdity, which measures the discrepancy between the data and the null hypothesis. Therefore, the p-value is not a direct measure of the probability that the null hypothesis is false. The latter, whose philosophical foundation is deduction, is based on long-run performance and controls the overall number of incorrect inferences in repeated sampling, without regard to individual cases. The p-value fallacy consists of interpreting the p-value as the strength of evidence against the null hypothesis and using it simultaneously with the frequency of Type I error under the null hypothesis (Goodman 1999).

The American Statistical Association has several concerns regarding the use of the p-value as a cornerstone for hypothesis testing in science. This concern motivates the ASA’s statement on p-values (Wasserstein and Lazar 2016), which can be summarized in the following principles:

“P-values can indicate how incompatible the data are with a specified statistical model.”
“P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.”
“Scientific conclusions and business or policy decisions should not be based solely on whether a p-value passes a specific threshold.”
“Proper inference requires full reporting and transparency.”
“A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.”
“By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.”

To sum up, Fisher proposed the p-value as a witness rather than a judge. So, a p-value lower than the significance level means more inspection of the null hypothesis, but it is not a final conclusion about it.

Another difference between the Frequentists and the Bayesians is the way in which scientific hypotheses are tested. The former use the p-value, whereas the latter use the Bayes factor. Observe that the p-value is associated with the probability of the data given the hypothesis, whereas the Bayes factor is associated with the probability of the hypothesis given the data. However, there is an approximate link between the \(t\) statistic and the Bayes factor for regression coefficients (A. Raftery 1995). In particular,

\[ |t|>(\log(N)+6)^{1/2} \]

corresponds to strong evidence in favor of rejecting the null hypothesis of no relevance of a control in a regression. Observe that, in this setting, the threshold of the \(t\) statistic, and as a consequence the significance level, depends on the sample size. This setting agrees with the idea in experimental designs of selecting the sample size such that we control Type I and Type II errors. In observational studies, we cannot control the sample size, but we can select the significance level.

See also Sellke, Bayarri, and Berger (2001) and Benjamin et al. (2018) for exercises that reveal potential flaws of the p-value (\(p\)) due to \(p \sim U[0,1]\) under the null hypothesis,¹⁷ and calibrations of the p-value to interpret it as the odds ratio and the error probability. In particular,

\[ B(p)=-e \times p \times \log(p) \quad \text{when} \quad p < e^{-1} \]

and interpret this as the Bayes factor of \(H_0\) to \(H_1\), where \(H_1\) denotes the unspecified alternative to \(H_0\), and

\[ \alpha(p) = \left(1 + \left[-e \times p \times \log(p)\right]^{-1}\right)^{-1} \]

as the error probability \(\alpha\) in rejecting \(H_0\). Take into account that \(B(p)\) and \(\alpha(p)\) are lower bounds.

The logic of argumentation in the Frequentist approach is based on deductive logic, which means that it starts from a statement about the true state of nature (null hypothesis) and predicts what should be observed if this statement were true. On the other hand, the Bayesian approach is based on inductive logic, which means that it defines which hypothesis is more consistent with what is observed. The former inferential approach establishes that the truth of the premises implies the truth of the conclusion, which is why we reject or fail to reject hypotheses. The latter establishes that the premises supply some evidence, but not full assurance, of the truth of the conclusion, which is why we get probabilistic statements.

Here, there is a distinction between the effects of causes (forward causal inference) and the causes of effects (reverse causal inference) (Andrew Gelman and Imbens 2013; Dawid, Musio, and Fienberg 2016). To illustrate this point, imagine that a firm increases the price of a specific good. Economic theory would suggest that, as a result, demand for the good decreases. In this case, the premise (null hypothesis) is the price increase, and the consequence is the decrease in the firm’s demand.

Alternatively, one could observe a reduction in a firm’s demand and attempt to identify the cause behind it. For example, a reduction in quantity could be due to a negative supply shock. The Frequentist approach typically follows the first view (effects of causes), while Bayesian reasoning focuses on determining the probability of potential causes (causes of effects).

References

Benjamin, Daniel J, James O Berger, Magnus Johannesson, Brian A Nosek, E-J Wagenmakers, Richard Berk, Kenneth A Bollen, et al. 2018. “Redefine Statistical Significance.” Nature Human Behaviour 2 (1): 6–10.

———. 2006. “The Case for Objective Bayesian Analysis.” Bayesian Analysis 1 (3): 385–402.

Bernardo, J., and A. Smith. 1994. Bayesian Theory. Chichester: Wiley.

Dawid, A. P., M. Musio, and S. E. Fienberg. 2016. “From Statistical Evidence to Evidence of Causality.” Bayesian Analysis 11 (3): 725–52.

DeGroot, M. H. 1975. Probability and Statistics. London: Addison-Wesley Publishing Co.

Fisher, R. 1958. Statistical Methods for Research Workers. 13th ed. New York: Hafner.

Gelman, Andrew, and Guido Imbens. 2013. “Why Ask Why? Forward Causal Inference and Reverse Causal Questions.” National Bureau of Economic Research.

Goodman, S. N. 1999. “Toward Evidence-Based Medical Statistics. 1: The P Value Fallacy.” Annals of Internal Medicine 130 (12): 995–1004.

———. 1961. Theory of Probability. London: Oxford University Press.

Kass, R E, and A E Raftery. 1995. “Bayes factors.” Journal of the American Statistical Association 90 (430): 773–95.

Koop, Gary M. 2003. Bayesian Econometrics. John Wiley & Sons Inc.

Lindley, D. V. 2000. “The Philosophy of Statistics.” The Statistician 49 (3): 293–337.

Neyman, J., and E. Pearson. 1933. “On the Problem of the Most Efficient Tests of Statistical Hypotheses.” Philosophical Transactions of the Royal Society, Series A 231: 289–337.

Pericchi, Luis, and Carlos Pereira. 2015. “Adaptative significance levels using optimal decision rules: Balancing by weighting the error probabilities.” Brazilian Journal of Probability and Statistics.

Raftery, A. 1995. “Bayesian Model Selection in Social Research.” Sociological Methodology 25: 111–63.

Sellke, Thomas, MJ Bayarri, and James O Berger. 2001. “Calibration of p Values for Testing Precise Null Hypotheses.” The American Statistician 55 (1): 62–71.

Wasserstein, Ronald L., and Nicole A. Lazar. 2016. “The ASA’s Statement on p–Values: Context, Process and Purpose.” The American Statistician.

A pivot quantity is a function of unobserved parameters and observations whose probability distribution does not depend on the unknown parameters.↩︎
An ancillary statistic is a pivotal quantity that is also a statistic.↩︎
See also: https://fivethirtyeight.com/features/not-even-scientists-can-easily-explain-p-values/↩︎
See: https://joyeuserrance.wordpress.com/2011/04/22/proof-that-p-values-under-the-null-are-uniformly-distributed/ for a simple proof.↩︎