6.8 Tobit model

The dependent variable is partially observed in Tobit models due to sampling schemes, whereas the regressors are completely observed. In particular,

\[\begin{equation} y_i = \begin{Bmatrix} L, & \quad y_i^* < L, \\ y_i^*, & \quad L \leq y_i^* < U, \\ U, & \quad y_i^* \geq U, \end{Bmatrix} \end{equation}\]

where \(y_i^* \stackrel{i.i.d.}{\sim} N(\mathbf{x}_i^{\top} \boldsymbol{\beta}, \sigma^2)\), \(\mathbf{x}_i\) is a \(K\)-dimensional vector of regressors.³⁰

We use conjugate independent priors \(\boldsymbol{\beta} \sim N(\boldsymbol{\beta}_0, \mathbf{B}_0)\) and \(\sigma^2 \sim IG(\alpha_0/2, \delta_0/2)\), and data augmentation using \(\mathbf{y}^*_C\) such that \(y_{C_i}^* \stackrel{i.n.d.}{\thicksim} N(\mathbf{x}_i^{\top} \boldsymbol{\beta}, \sigma^2)\), \(y_{C_i} = \left\{ y_{C_i^L}^* \cup y_{C_i^U}^* \right\}\) are lower and upper censored data. This allows implementing the Gibbs sampling algorithm (S. Chib 1992).

Then,

\[\begin{align} \pi(\boldsymbol{\beta}, \sigma^2, \mathbf{y^*} \mid \mathbf{y}, \mathbf{X}) &\propto \prod_{i=1}^N \left[ \mathbb{1}({y_i = L}) \mathbb{1}({y_{C_i^L}^* < L}) + \mathbb{1}({L \leq y_i < U}) + \mathbb{1}({y_i = U}) \mathbb{1}({y_{C_i^U}^* \geq U}) \right] \\ &\times N(y_i^* \mid \mathbf{x}_i^{\top} \boldsymbol{\beta}, \sigma^2) \times N(\boldsymbol{\beta} \mid \boldsymbol{\beta}_0, \mathbf{B}_0) \times IG(\sigma^2 \mid \alpha_0/2, \delta_0/2) \end{align}\]

The posterior distributions are:

\[\begin{equation} y_{C_i}^* \mid \boldsymbol{\beta}, \sigma^2, \mathbf{y}, \mathbf{X} \sim \begin{Bmatrix} TN_{(-\infty,L)}(\mathbf{x}_i^{\top}\boldsymbol{\beta}, \sigma^2) \ , \ y_i = L \\ TN_{[U, \infty)}(\mathbf{x}_i^{\top}\boldsymbol{\beta}, \sigma^2) \ \ , \ y_i = U \end{Bmatrix} \end{equation}\]

\[\begin{equation} \boldsymbol{\beta} \mid \sigma^2, \mathbf{y}, \mathbf{X} \sim N(\boldsymbol{\beta}_n, \sigma^2 \mathbf{B}_n) \end{equation}\]

\[\begin{equation} \sigma^2 \mid \boldsymbol{\beta}, \mathbf{y}, \mathbf{X} \sim IG(\alpha_n/2, \delta_n/2) \end{equation}\]

where \(\mathbf{B}_n = (\mathbf{B}_0^{-1} + \sigma^{-2} \mathbf{X}^{\top} \mathbf{X})^{-1}\), \(\boldsymbol{\beta}_n = \mathbf{B}_n(\mathbf{B}_0^{-1} \boldsymbol{\beta}_0 + \sigma^{-2} \mathbf{X}^{\top} \mathbf{y}^*)\), \(\alpha_n = \alpha_0 + N\), and \(\delta_n = \delta_0 + (\mathbf{y}^* - \mathbf{X} \boldsymbol{\beta})^{\top}(\mathbf{y}^* - \mathbf{X} \boldsymbol{\beta})\).

Example: The market value of soccer players in Europe continues

We continue with the example of the market value of soccer players from Section 6.1. We specify the same equation but assume that the sample is censored from below, such that we only have information for soccer players whose market value is higher than one million euros. The dependent variable is log(ValueCens), and the left censoring point is 13.82.

The following Algorithm illustrates how to estimate Tobit models in our GUI. Our GUI utilizes the MCMCtobit command from the MCMCpack package.

Algorithm: Tobit models

Select Univariate Models on the top panel
Select Tobit model using the left radio button
Upload the dataset by first selecting whether there is a header in the file, and the kind of separator in the csv file of the dataset (comma, semicolon, or tab). Then, use the Browse button under the Choose File legend. You should see a preview of the dataset
Select MCMC iterations, burn-in, and thinning parameters using the Range sliders
Select dependent and independent variables using the Formula builder table
Click the Build formula button to generate the formula in R syntax. You can modify the formula in the Main equation box using valid arguments of the formula command structure in R
Set the left and right censoring points. To censor above only, specify -Inf in the left censoring box, and to censor below only, specify Inf in the right censoring box
Set the hyperparameters: mean vector, covariance matrix, shape, and scale parameters. This step is not necessary as by default our GUI uses non-informative priors
Click the Go! button
Analyze results
Download posterior chains and diagnostic plots using the Download Posterior Chains and Download Posterior Graphs buttons

We run this application using the same hyperparameters that we set in the example of Section 6.1. All results seem similar to those in the example of linear models. In addition, the posterior chains seem to achieve good diagnostics.

rm(list = ls()); set.seed(010101)
Data <- read.csv("https://raw.githubusercontent.com/besmarter/BSTApp/refs/heads/master/DataApp/1ValueFootballPlayers.csv", sep = ",", header = TRUE, quote = "")
attach(Data)

## The following objects are masked from Data (pos = 3):
## 
##     Age, Age2

## The following objects are masked from mydata:
## 
##     Age, Age2

## The following objects are masked from Data (pos = 5):
## 
##     Age, Age2, Assists, Exp, Exp2, Goals, Goals2, NatTeam, Perf, Perf2,
##     Player, Value, ValueCens

y <- log(ValueCens) 
X <- cbind(1, Perf, Age, Age2, NatTeam, Goals, Exp, Exp2)
k <- dim(X)[2]
N <- dim(X)[1]
# Hyperparameters
d0 <- 0.001; a0 <- 0.001
b0 <- rep(0, k); c0 <- 1000; B0 <- c0*diag(k)
B0i <- solve(B0)
# MCMC parameters
mcmc <- 50000
burnin <- 10000
tot <- mcmc + burnin
thin <- 1
# Posterior distributions using packages: MCMCpack sets the model in terms of the precision matrix
posterior  <- MCMCpack::MCMCtobit(y~X-1, b0=b0, B0 = B0i, c0 = a0, d0 = d0, burnin = burnin, mcmc = mcmc, thin = thin, below = 13.82, above = Inf)
summary(coda::mcmc(posterior))

## 
## Iterations = 1:50000
## Thinning interval = 1 
## Number of chains = 1 
## Sample size per chain = 50000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##               Mean       SD  Naive SE Time-series SE
## X         1.049702 2.639780 1.181e-02      1.676e-02
## XPerf     0.033979 0.004512 2.018e-05      2.288e-05
## XAge      1.024758 0.213319 9.540e-04      1.337e-03
## XAge2    -0.021861 0.004002 1.790e-05      2.546e-05
## XNatTeam  0.847758 0.126113 5.640e-04      6.463e-04
## XGoals    0.010091 0.001649 7.376e-06      7.688e-06
## XExp      0.174196 0.070237 3.141e-04      3.808e-04
## XExp2    -0.005652 0.002977 1.331e-05      1.555e-05
## sigma2    0.982768 0.095944 4.291e-04      6.698e-04
## 
## 2. Quantiles for each variable:
## 
##               2.5%       25%       50%       75%      97.5%
## X        -4.192313 -0.713326  1.071600  2.841806  6.1771087
## XPerf     0.025131  0.030948  0.033967  0.036976  0.0428691
## XAge      0.609991  0.880679  1.023709  1.167147  1.4480407
## XAge2    -0.029803 -0.024542 -0.021820 -0.019162 -0.0141026
## XNatTeam  0.603262  0.762408  0.847351  0.932705  1.0950039
## XGoals    0.006877  0.008975  0.010094  0.011193  0.0133502
## XExp      0.037752  0.126722  0.173706  0.221290  0.3127266
## XExp2    -0.011525 -0.007632 -0.005642 -0.003657  0.0001633
## sigma2    0.811752  0.915417  0.976781  1.042729  1.1887701

# Gibbs sampling functions
XtX <- t(X)%*%X
PostBeta <- function(Yl, sig2){
    Bn <- solve(B0i + sig2^(-1)*XtX)
    bn <- Bn%*%(B0i%*%b0 + sig2^(-1)*t(X)%*%Yl)
    Beta <- MASS::mvrnorm(1, bn, Bn)
    return(Beta)
}
PostYl <- function(Beta, L, U, i){
    Ylmean <- X[i,]%*%Beta
    if(y[i] == L){
        Yli <- truncnorm::rtruncnorm(1, a = -Inf, b = L, mean = Ylmean, sd = sig2^0.5)
    }else{
        if(y[i] == U){
            Yli <- truncnorm::rtruncnorm(1, a = U, b = Inf, mean = Ylmean, sd = sig2^0.5)
        }else{
            Yli <- y[i]
        }
    }
    return(Yli)
}
PostSig2 <- function(Beta, Yl){
    an <- a0 + length(y)
    dn <- d0 + t(Yl - X%*%Beta)%*%(Yl - X%*%Beta)
    sig2 <- invgamma::rinvgamma(1, shape = an/2, rate = dn/2)
    return(sig2)
}
PostBetas <- matrix(0, mcmc+burnin, k); Beta <- rep(0, k)
PostSigma2 <- rep(0, mcmc+burnin); sig2 <- 1
L <- log(1000000); U <- Inf
# create progress bar in case that you want to see iterations progress
pb <- winProgressBar(title = "progress bar", min = 0, max = tot, width = 300)
for(s in 1:tot){
    Yl <- sapply(1:N, function(i){PostYl(Beta = Beta, L = L, U = U, i)})
    Beta <- PostBeta(Yl = Yl, sig2)
    sig2 <- PostSig2(Beta = Beta, Yl = Yl) 
    PostBetas[s,] <- Beta; PostSigma2[s] <- sig2
    setWinProgressBar(pb, s, title=paste( round(s/tot*100, 0), "% done"))
}
close(pb)

## NULL

keep <- seq((burnin+1), tot, thin)
PosteriorBetas <- PostBetas[keep,]
colnames(PosteriorBetas) <- c("Intercept", "Perf", "Age", "Age2", "NatTeam", "Goals", "Exp", "Exp2")
summary(coda::mcmc(PosteriorBetas))

## 
## Iterations = 1:50000
## Thinning interval = 1 
## Number of chains = 1 
## Sample size per chain = 50000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##                Mean       SD  Naive SE Time-series SE
## Intercept  1.023783 2.640151 1.181e-02      1.649e-02
## Perf       0.033982 0.004496 2.011e-05      2.198e-05
## Age        1.026527 0.213505 9.548e-04      1.322e-03
## Age2      -0.021902 0.004005 1.791e-05      2.520e-05
## NatTeam    0.849435 0.125673 5.620e-04      6.394e-04
## Goals      0.010093 0.001652 7.390e-06      7.745e-06
## Exp        0.175137 0.070498 3.153e-04      3.877e-04
## Exp2      -0.005685 0.002985 1.335e-05      1.578e-05
## 
## 2. Quantiles for each variable:
## 
##                2.5%       25%       50%       75%      97.5%
## Intercept -4.218202 -0.737377  1.048790  2.818320  6.1265295
## Perf       0.025253  0.030947  0.033956  0.036986  0.0428926
## Age        0.612543  0.881473  1.024929  1.169122  1.4478207
## Age2      -0.029837 -0.024583 -0.021874 -0.019170 -0.0141402
## NatTeam    0.603543  0.764706  0.849179  0.933046  1.0994449
## Goals      0.006808  0.008979  0.010097  0.011208  0.0133334
## Exp        0.038449  0.127461  0.174294  0.223045  0.3138970
## Exp2      -0.011563 -0.007695 -0.005659 -0.003663  0.0001001

summary(coda::mcmc(PostSigma2[keep]))

## 
## Iterations = 1:50000
## Thinning interval = 1 
## Number of chains = 1 
## Sample size per chain = 50000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##           Mean             SD       Naive SE Time-series SE 
##      0.9860750      0.0963754      0.0004310      0.0006757 
## 
## 2. Quantiles for each variable:
## 
##   2.5%    25%    50%    75%  97.5% 
## 0.8156 0.9186 0.9792 1.0469 1.1937

References

Chib, S. 1992. “Bayes Inference in the Tobit Censored Regression Model.” Journal of Econometrics 51: 79–99.

We can set \(L\) or \(U\) equal to \(-\infty\) or \(\infty\) to model data censored on just one side.↩︎