6.10 Bayesian bootstrap regression
We implement the Bayesian bootstrap (Donnald B. Rubin 1981) for linear regression models. In particular, the Bayesian bootstrap simulates the posterior distributions by assuming that the sample cumulative distribution function (CDF) is the population CDF (this assumption is also implicit in the frequentist bootstrap (B. Efron 1979)).
Given \(y_i \stackrel{i.i.d.}{\sim} \mathcal{F}\), where \(\mathcal{F}\) does not specify a particular parametric family of distributions, but instead sets \(\mathbb{E}(y_i \mid \boldsymbol{x}_i) = \boldsymbol{x}_i^{\top} \boldsymbol{\beta}\), with \(\boldsymbol{x}_i\) being a \(K\)-dimensional vector of regressors and \(\boldsymbol{\beta}\) a \(K\)-dimensional vector of parameters, the Bayesian bootstrap generates posterior probabilities for each \(y_i\), where the values of \(\boldsymbol{y}\) that are not observed have zero posterior probability.
The algorithm to implement the Bayesian bootstrap is the following:
Draw g ∼ Dir(α1, α2, ..., αN) such that αi=1 for all i g=(g1, g2, ..., gN) is the vector of probabilities to attach to (y1,x1), (y2,x2), ..., (yN,xN) for each Bayesian bootstrap replication. Sample (yi,xi) N times with replacement and probabilities gi, i=1,2, ...,N. Estimate β using ordinary least squares in the model E(y|X)=Xβ, y being a S1 dimensional vector, and X a S1 X K matrix from the previous stage. Repeat this process S times. The distribution of β(s) is the Bayesian distribution of β.
Example: Simulation exercise
Let’s perform a simulation exercise to evaluate the performance of the previous Algorithm for inference using the Bayesian bootstrap. The data-generating process is defined by two regressors, each distributed as standard normal. The location vector is \(\boldsymbol{\beta} = \left[1 \ 1 \ 1\right]^{\top}\), with a variance of \(\sigma^2 = 1\), and the sample size is 1,000.
The following Algorithm illustrates how to use our GUI to run the Bayesian bootstrap. Our GUI is based on the bayesboot command from the bayesboot package in R. Exercise 11 asks about using this package to perform inference in this simulation and compares the results with those obtained using our GUI with \(S = 10000\).
Algorithm: Bayesian Bootstrap in Linear Regression
- Select Univariate Models on the top panel
- Select Bootstrap model using the left radio button
- Upload the dataset by first selecting whether there is a header in the file, and the kind of separator in the csv file of the dataset (comma, semicolon, or tab). Then, use the Browse button under the Choose File legend. You should see a preview of the dataset
- Select number of bootstrap replications using the Range sliders
- Select dependent and independent variables using the Formula builder table
- Click the Build formula button to generate the formula in R syntax. You can modify the formula in the Main equation box using valid arguments of the formula command structure in R
- Click the Go! button
- Analyze results
- Download posterior chains and diagnostic plots using the Download Posterior Chains and Download Posterior Graphs buttons
The following R code shows how to program the Bayesian bootstrap from scratch. We observe from the results that all 95% credible intervals encompass the population parameters, and the posterior means are close to the population parameters.
rm(list = ls()); set.seed(010101)
N <- 1000; x1 <- runif(N); x2 <- rnorm(N)
X <- cbind(x1, x2); k <- dim(X)[2]
B <- rep(1, k+1); sig2 <- 1
u <- rnorm(N, 0, sig2); y <- cbind(1, X)%*%B + u
data <- as.data.frame(cbind(y, X))
names(data) <- c("y", "x1", "x2")
Reg <- function(d){
Reg <- lm(y ~ x1 + x2, data = d)
Bhat <- Reg$coef
return(Bhat)
}
S <- 10000; alpha <- 1
BB <- function(S, df, alpha){
Betas <- matrix(NA, S, dim(df)[2])
N <- dim(df)[1]
pb <- winProgressBar(title = "progress bar", min = 0, max = S, width = 300)
for(s in 1:S){
g <- LaplacesDemon::rdirichlet(N, alpha)
ids <- sample(1:N, size = N, replace = TRUE, prob = g)
datas <- df[ids,]
names(datas) <- names(df)
Bs <- Reg(d = datas)
Betas[s, ] <- Bs
setWinProgressBar(pb, s, title=paste( round(s/S*100, 0), "% done"))
}
close(pb)
return(Betas)
}
BBs <- BB(S = S, df = data, alpha = alpha)
summary(coda::mcmc(BBs))
##
## Iterations = 1:10000
## Thinning interval = 1
## Number of chains = 1
## Sample size per chain = 10000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## [1,] 0.9182 0.06475 0.0006475 0.0006475
## [2,] 1.1735 0.10948 0.0010948 0.0010948
## [3,] 1.0133 0.03359 0.0003359 0.0003359
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## var1 0.7882 0.8755 0.9178 0.9612 1.045
## var2 0.9560 1.1009 1.1738 1.2467 1.391
## var3 0.9467 0.9902 1.0134 1.0362 1.079