6.3 The probit model

The probit model also has a binary dependent variable. In this case, there is a latent variable (\(y_i^*\), which is unobserved) that defines the structure of the estimation problem.

In particular,

\[ y_i = \begin{cases} 0, & \text{if } y_i^* \leq 0, \\ 1, & \text{if } y_i^* > 0. \end{cases} \]

such that \(y_i^* = \mathbf{x}_i^{\top}\boldsymbol{\beta} + \mu_i\), where \(\mu_i \stackrel{i.i.d.}{\sim} N(0,1)\).²⁴ This implies \(P(y_i = 1) = \pi_i = \Phi(\mathbf{x}_i^{\top} \boldsymbol{\beta})\), where \(\mathbf{x}_i\) is a \(K\)-dimensional vector of regressors.

Albert and Chib (1993) implemented data augmentation (Tanner and Wong 1987) to apply a Gibbs sampling algorithm to this model. Augmenting this model with \(y_i^*\), we can express the likelihood contribution from observation \(i\) as:

\[ p(y_i \mid y_i^*) = \mathbb{1}({y_i = 0}) \mathbb{1}({y_i^* \leq 0}) + \mathbb{1}({y_i = 1}) \mathbb{1}({y_i^* > 0}), \]

where \(\mathbb{1}(A)\) is an indicator function that takes the value of 1 when the condition \(A\) is satisfied.

The posterior distribution is:

\[ \pi(\boldsymbol{\beta}, \mathbf{y^*} \mid \mathbf{y}, \mathbf{X}) \propto \prod_{i=1}^N \left[\mathbb{1}({y_i = 0}) \mathbb{1}({y_i^* \leq 0}) + \mathbb{1}({y_i = 1}) \mathbb{1}({y_i^* > 0}) \right] \]

\[ \times N_N(\mathbf{y^*} \mid \mathbf{X\boldsymbol{\beta}}, \mathbf{I}_n) \times N_K(\boldsymbol{\beta} \mid \boldsymbol{\beta}_0, \mathbf{B}_0), \]

where we assume a Gaussian prior for \(\boldsymbol{\beta}\): \(\boldsymbol{\beta} \sim N_K(\boldsymbol{\beta}_0, \mathbf{B}_0)\).

This implies

\[ y_i^* \mid \boldsymbol{\beta}, \mathbf{y}, \mathbf{X} \sim \begin{cases} TN_{(-\infty,0]}(\mathbf{x}_i^{\top} \boldsymbol{\beta}, 1) & \text{if } y_i = 0, \\ TN_{(0,\infty)}(\mathbf{x}_i^{\top} \boldsymbol{\beta}, 1) & \text{if } y_i = 1, \end{cases}, \]

where \(TN\) denotes a truncated normal density.

\[ \boldsymbol{\beta} \mid \mathbf{y}^*, \mathbf{X} \sim N(\boldsymbol{\beta}_n, \mathbf{B}_n), \]

where \(\mathbf{B}_n = (\mathbf{B}_0^{-1} + \mathbf{X}^{\top} \mathbf{X})^{-1}\), and \(\boldsymbol{\beta}_n = \mathbf{B}_n (\mathbf{B}_0^{-1} \boldsymbol{\beta}_0 + \mathbf{X}^{\top} \mathbf{y}^*)\).

Example: Determinants of hospitalization

We use the dataset named 2HealthMed.csv, which is located in the DataApp folder of our GitHub repository (https://github.com/besmarter/BSTApp), and was used by Ramírez Hassan, Cardona Jiménez, and Cadavid Montoya (2013). The dependent variable is a binary indicator, taking the value 1 if an individual was hospitalized in 2007, and 0 otherwise.

The specification of the model is

\[ \text{Hosp}_i = \boldsymbol{\beta}_1 + \boldsymbol{\beta}_2 \text{SHI}_i + \boldsymbol{\beta}_3 \text{Female}_i + \boldsymbol{\beta}_4 \text{Age}_i + \boldsymbol{\beta}_5 \text{Age}_i^2 + \boldsymbol{\beta}_6 \text{Est2}_i + \boldsymbol{\beta}_7 \text{Est3}_i + \boldsymbol{\beta}_8 \text{Fair}_i + \boldsymbol{\beta}_9 \text{Good}_i + \boldsymbol{\beta}_{10} \text{Excellent}_i, \]

where SHI is a binary variable equal to 1 if the individual is enrolled in a subsidized health care program and 0 otherwise, Female is an indicator of gender, Age is in years, Est2 and Est3 are indicators of socioeconomic status, with Est1 being the reference category (the lowest status), and HealthStatus is a self-perception of health status, where bad is the reference category.

We set \(\boldsymbol{\beta}_0 = {\boldsymbol{0}}_{10}\), \({\boldsymbol{B}}_0 = {\boldsymbol{I}}_{10}\), with iterations, burn-in, and thinning parameters equal to 10000, 1000, and 1, respectively. We can use the next Algorithm to run the probit model in our GUI. Our GUI relies on the rbprobitGibbs command from the bayesm package to perform inference in the probit model. The following R code shows how to run this example using the rbprobitGibbs command. We also asked you to implement a Gibbs sampler algorithm to perform inference in the probit model in the exercises.

Algorithm: Probit model in the GUI

Select Univariate Models on the top panel
Select Probit model using the left radio button
Upload the dataset by first selecting whether there is a header in the file, and the kind of separator in the csv file of the dataset (comma, semicolon, or tab). Then, use the Browse button under the Choose File legend. You should see a preview of the dataset
Select MCMC iterations, burn-in, and thinning parameters using the Range sliders
Select dependent and independent variables using the Formula builder table
Click the Build formula button to generate the formula in R syntax. You can modify the formula in the Main equation box using valid arguments of the formula command structure in R
Set the hyperparameters: mean vector and covariance matrix. This step is not necessary as by default our GUI uses non-informative priors
Select the tuning parameter for the Metropolis-Hastings algorithm. This step is not necessary as by default our GUI sets the tuning parameter at 1.1
Click the Go! button
Analyze results
Download posterior chains and diagnostic plots using the Download Posterior Chains and Download Posterior Graphs buttons

Our analysis finds evidence that gender and self-perceived health status significantly affect the probability of hospitalization. Women have a higher probability of being hospitalized than men, and individuals with a better perception of their health status have a lower probability of hospitalization.

mydata <- read.csv("https://raw.githubusercontent.com/besmarter/BSTApp/refs/heads/master/DataApp/2HealthMed.csv", sep = ",", header = TRUE, quote = "")
attach(mydata)

## The following objects are masked from Data:
## 
##     Age, Age2

str(mydata)

## 'data.frame':    12975 obs. of  22 variables:
##  $ id          : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ MedVisPrev  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ MedVisPrevOr: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Hosp        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ SHI         : int  1 1 1 1 0 0 1 1 0 0 ...
##  $ Female      : int  0 1 1 1 0 1 0 1 0 1 ...
##  $ Age         : int  7 39 23 15 8 54 64 40 6 7 ...
##  $ Age2        : int  49 1521 529 225 64 2916 4096 1600 36 49 ...
##  $ FemaleAge   : int  0 39 23 15 0 54 0 40 0 7 ...
##  $ Est1        : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ Est2        : int  0 1 1 1 0 1 1 1 0 0 ...
##  $ Est3        : int  0 0 0 0 1 0 0 0 1 1 ...
##  $ Bad         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fair        : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ Good        : int  1 1 1 1 0 0 0 1 1 1 ...
##  $ Excellent   : int  0 0 0 0 1 1 0 0 0 0 ...
##  $ NoEd        : int  1 0 0 0 1 0 0 0 1 1 ...
##  $ PriEd       : int  0 0 0 0 0 1 1 1 0 0 ...
##  $ HighEd      : int  0 1 1 1 0 0 0 0 0 0 ...
##  $ VocEd       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ UnivEd      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PTL         : num  0.43 0 0 0 0 0.06 0 0.38 0 1 ...

K <- 10 # Number of regressors
b0 <- rep(0, K) # Prio mean
B0i <- diag(K) # Prior precision (inverse of covariance)
Prior <- list(betabar = b0, A = B0i) # Prior list
y <- Hosp # Dependent variables
X <- cbind(1, SHI, Female, Age, Age2, Est2, Est3, Fair, Good, Excellent) # Regressors
Data <- list(y = y, X = X) # Data list
Mcmc <- list(R = 10000, keep = 1, nprint = 0) # MCMC parameters
RegProb <- bayesm::rbprobitGibbs(Data = Data, Prior = Prior, Mcmc = Mcmc) # Inference using bayesm package

##  
## Starting Gibbs Sampler for Binary Probit Model
##    with  12975  observations
## Table of y Values
## y
##     0     1 
## 12571   404 
##  
## Prior Parms: 
## betabar
##  [1] 0 0 0 0 0 0 0 0 0 0
## A
##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
##  [1,]    1    0    0    0    0    0    0    0    0     0
##  [2,]    0    1    0    0    0    0    0    0    0     0
##  [3,]    0    0    1    0    0    0    0    0    0     0
##  [4,]    0    0    0    1    0    0    0    0    0     0
##  [5,]    0    0    0    0    1    0    0    0    0     0
##  [6,]    0    0    0    0    0    1    0    0    0     0
##  [7,]    0    0    0    0    0    0    1    0    0     0
##  [8,]    0    0    0    0    0    0    0    1    0     0
##  [9,]    0    0    0    0    0    0    0    0    1     0
## [10,]    0    0    0    0    0    0    0    0    0     1
##  
## MCMC parms: 
## R=  10000  keep=  1  nprint=  0
##

PostPar <- coda::mcmc(RegProb$betadraw) # Posterior draws
colnames(PostPar) <- c("Cte", "SHI", "Female", "Age", "Age2", "Est2", "Est3", "Fair", "Good", "Excellent") # Names
summary(PostPar) # Posterior summary

## 
## Iterations = 1:10000
## Thinning interval = 1 
## Number of chains = 1 
## Sample size per chain = 10000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##                 Mean        SD  Naive SE Time-series SE
## Cte       -9.443e-01 1.347e-01 1.347e-03      3.413e-03
## SHI       -7.227e-03 5.934e-02 5.934e-04      2.229e-03
## Female     1.260e-01 4.807e-02 4.807e-04      1.780e-03
## Age        1.112e-04 3.551e-03 3.551e-05      1.164e-04
## Age2       4.078e-05 4.222e-05 4.222e-07      1.249e-06
## Est2      -8.658e-02 5.370e-02 5.370e-04      1.898e-03
## Est3      -4.254e-02 8.112e-02 8.112e-04      2.774e-03
## Fair      -4.961e-01 1.119e-01 1.119e-03      2.013e-03
## Good      -1.204e+00 1.114e-01 1.114e-03      2.205e-03
## Excellent -1.061e+00 1.316e-01 1.316e-03      3.136e-03
## 
## 2. Quantiles for each variable:
## 
##                 2.5%        25%        50%        75%      97.5%
## Cte       -1.210e+00 -1.034e+00 -9.450e-01 -8.523e-01 -0.6839311
## SHI       -1.207e-01 -4.742e-02 -8.527e-03  3.225e-02  0.1125873
## Female     3.273e-02  9.359e-02  1.261e-01  1.585e-01  0.2219169
## Age       -6.849e-03 -2.343e-03  6.224e-05  2.535e-03  0.0071123
## Age2      -4.198e-05  1.215e-05  4.177e-05  6.976e-05  0.0001213
## Est2      -1.911e-01 -1.235e-01 -8.634e-02 -4.971e-02  0.0168394
## Est3      -2.018e-01 -9.709e-02 -4.153e-02  1.256e-02  0.1129273
## Fair      -7.141e-01 -5.704e-01 -4.971e-01 -4.218e-01 -0.2752697
## Good      -1.419e+00 -1.279e+00 -1.205e+00 -1.131e+00 -0.9832597
## Excellent -1.323e+00 -1.147e+00 -1.062e+00 -9.730e-01 -0.8028882

References

Albert, J. H., and S. Chib. 1993. “Bayesian Analysis of Binary and Polychotomous Response Data.” Journal of the American Statistical Association 88 (422): 669–79.

Ramírez Hassan, A., J. Cardona Jiménez, and R. Cadavid Montoya. 2013. “The Impact of Subsidized Health Insurance on the Poor in Colombia: Evaluating the Case of Medellín.” Economia Aplicada 17 (4): 543–56.

Tanner, M. A., and W. H. Wong. 1987. “The Calculation of Posterior Distributions by Data Augmentation.” Journal of the American Statistical Association 82 (398): 528–40.

The variance in this model is set to 1 due to identification restrictions. Observe that \(P(y_i = 1 \mid \mathbf{x}_i) = P(y_i^* > 0 \mid \mathbf{x}_i) = P(\mathbf{x}_i^{\top} \boldsymbol{\beta} + \mu_i > 0 \mid \mathbf{x}_i) = P(\mu_i > -\mathbf{x}_i^{\top} \boldsymbol{\beta} \mid \mathbf{x}_i) = P(c \times \mu_i > -c \times \mathbf{x}_i^{\top} \boldsymbol{\beta} \mid \mathbf{x}_i)\) for all \(c > 0\). Multiplying by a positive constant does not affect the probability of \(y_i = 1\).↩︎