6.8 Tobit model
The dependent variable is partially observed in Tobit models due to sampling schemes, whereas the regressors are completely observed. In particular,
\[\begin{equation} y_i = \begin{Bmatrix} L, & \quad y_i^* < L, \\ y_i^*, & \quad L \leq y_i^* < U, \\ U, & \quad y_i^* \geq U, \end{Bmatrix} \end{equation}\]
where \(y_i^* \stackrel{i.i.d.}{\sim} N(\mathbf{x}_i^{\top} \boldsymbol{\beta}, \sigma^2)\), \(\mathbf{x}_i\) is a \(K\)-dimensional vector of regressors.30
We use conjugate independent priors \(\boldsymbol{\beta} \sim N(\boldsymbol{\beta}_0, \mathbf{B}_0)\) and \(\sigma^2 \sim IG(\alpha_0/2, \delta_0/2)\), and data augmentation using \(\mathbf{y}^*_C\) such that \(y_{C_i}^* \stackrel{i.n.d.}{\thicksim} N(\mathbf{x}_i^{\top} \boldsymbol{\beta}, \sigma^2)\), \(y_{C_i} = \left\{ y_{C_i^L}^* \cup y_{C_i^U}^* \right\}\) are lower and upper censored data. This allows implementing the Gibbs sampling algorithm (S. Chib 1992).
Then,
\[\begin{align} \pi(\boldsymbol{\beta}, \sigma^2, \mathbf{y^*} \mid \mathbf{y}, \mathbf{X}) &\propto \prod_{i=1}^N \left[ \mathbb{1}({y_i = L}) \mathbb{1}({y_{C_i^L}^* < L}) + \mathbb{1}({L \leq y_i < U}) + \mathbb{1}({y_i = U}) \mathbb{1}({y_{C_i^U}^* \geq U}) \right] \\ &\times N(y_i^* \mid \mathbf{x}_i^{\top} \boldsymbol{\beta}, \sigma^2) \times N(\boldsymbol{\beta} \mid \boldsymbol{\beta}_0, \mathbf{B}_0) \times IG(\sigma^2 \mid \alpha_0/2, \delta_0/2) \end{align}\]
The posterior distributions are:
\[\begin{equation} y_{C_i}^* \mid \boldsymbol{\beta}, \sigma^2, \mathbf{y}, \mathbf{X} \sim \begin{Bmatrix} TN_{(-\infty,L)}(\mathbf{x}_i^{\top}\boldsymbol{\beta}, \sigma^2) \ , \ y_i = L \\ TN_{[U, \infty)}(\mathbf{x}_i^{\top}\boldsymbol{\beta}, \sigma^2) \ \ , \ y_i = U \end{Bmatrix} \end{equation}\]
\[\begin{equation} \boldsymbol{\beta} \mid \sigma^2, \mathbf{y}, \mathbf{X} \sim N(\boldsymbol{\beta}_n, \sigma^2 \mathbf{B}_n) \end{equation}\]
\[\begin{equation} \sigma^2 \mid \boldsymbol{\beta}, \mathbf{y}, \mathbf{X} \sim IG(\alpha_n/2, \delta_n/2) \end{equation}\]
where \(\mathbf{B}_n = (\mathbf{B}_0^{-1} + \sigma^{-2} \mathbf{X}^{\top} \mathbf{X})^{-1}\), \(\boldsymbol{\beta}_n = \mathbf{B}_n(\mathbf{B}_0^{-1} \boldsymbol{\beta}_0 + \sigma^{-2} \mathbf{X}^{\top} \mathbf{y}^*)\), \(\alpha_n = \alpha_0 + N\), and \(\delta_n = \delta_0 + (\mathbf{y}^* - \mathbf{X} \boldsymbol{\beta})^{\top}(\mathbf{y}^* - \mathbf{X} \boldsymbol{\beta})\).
Example: The market value of soccer players in Europe continues
We continue with the example of the market value of soccer players from Section 6.1. We specify the same equation but assume that the sample is censored from below, such that we only have information for soccer players whose market value is higher than one million euros. The dependent variable is log(ValueCens), and the left censoring point is 13.82.
The following Algorithm illustrates how to estimate Tobit models in our GUI. Our GUI utilizes the MCMCtobit command from the MCMCpack package.
Algorithm: Tobit models
- Select Univariate Models on the top panel
- Select Tobit model using the left radio button
- Upload the dataset by first selecting whether there is a header in the file, and the kind of separator in the csv file of the dataset (comma, semicolon, or tab). Then, use the Browse button under the Choose File legend. You should see a preview of the dataset
- Select MCMC iterations, burn-in, and thinning parameters using the Range sliders
- Select dependent and independent variables using the Formula builder table
- Click the Build formula button to generate the formula in R syntax. You can modify the formula in the Main equation box using valid arguments of the formula command structure in R
- Set the left and right censoring points. To censor above only, specify -Inf in the left censoring box, and to censor below only, specify Inf in the right censoring box
- Set the hyperparameters: mean vector, covariance matrix, shape, and scale parameters. This step is not necessary as by default our GUI uses non-informative priors
- Click the Go! button
- Analyze results
- Download posterior chains and diagnostic plots using the Download Posterior Chains and Download Posterior Graphs buttons
We run this application using the same hyperparameters that we set in the example of Section 6.1. All results seem similar to those in the example of linear models. In addition, the posterior chains seem to achieve good diagnostics.
rm(list = ls()); set.seed(010101)
Data <- read.csv("https://raw.githubusercontent.com/besmarter/BSTApp/refs/heads/master/DataApp/1ValueFootballPlayers.csv", sep = ",", header = TRUE, quote = "")
attach(Data)
## The following objects are masked from Data (pos = 3):
##
## Age, Age2
## The following objects are masked from mydata:
##
## Age, Age2
## The following objects are masked from Data (pos = 5):
##
## Age, Age2, Assists, Exp, Exp2, Goals, Goals2, NatTeam, Perf, Perf2,
## Player, Value, ValueCens
y <- log(ValueCens)
X <- cbind(1, Perf, Age, Age2, NatTeam, Goals, Exp, Exp2)
k <- dim(X)[2]
N <- dim(X)[1]
# Hyperparameters
d0 <- 0.001; a0 <- 0.001
b0 <- rep(0, k); c0 <- 1000; B0 <- c0*diag(k)
B0i <- solve(B0)
# MCMC parameters
mcmc <- 50000
burnin <- 10000
tot <- mcmc + burnin
thin <- 1
# Posterior distributions using packages: MCMCpack sets the model in terms of the precision matrix
posterior <- MCMCpack::MCMCtobit(y~X-1, b0=b0, B0 = B0i, c0 = a0, d0 = d0, burnin = burnin, mcmc = mcmc, thin = thin, below = 13.82, above = Inf)
summary(coda::mcmc(posterior))
##
## Iterations = 1:50000
## Thinning interval = 1
## Number of chains = 1
## Sample size per chain = 50000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## X 1.049702 2.639780 1.181e-02 1.676e-02
## XPerf 0.033979 0.004512 2.018e-05 2.288e-05
## XAge 1.024758 0.213319 9.540e-04 1.337e-03
## XAge2 -0.021861 0.004002 1.790e-05 2.546e-05
## XNatTeam 0.847758 0.126113 5.640e-04 6.463e-04
## XGoals 0.010091 0.001649 7.376e-06 7.688e-06
## XExp 0.174196 0.070237 3.141e-04 3.808e-04
## XExp2 -0.005652 0.002977 1.331e-05 1.555e-05
## sigma2 0.982768 0.095944 4.291e-04 6.698e-04
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## X -4.192313 -0.713326 1.071600 2.841806 6.1771087
## XPerf 0.025131 0.030948 0.033967 0.036976 0.0428691
## XAge 0.609991 0.880679 1.023709 1.167147 1.4480407
## XAge2 -0.029803 -0.024542 -0.021820 -0.019162 -0.0141026
## XNatTeam 0.603262 0.762408 0.847351 0.932705 1.0950039
## XGoals 0.006877 0.008975 0.010094 0.011193 0.0133502
## XExp 0.037752 0.126722 0.173706 0.221290 0.3127266
## XExp2 -0.011525 -0.007632 -0.005642 -0.003657 0.0001633
## sigma2 0.811752 0.915417 0.976781 1.042729 1.1887701
# Gibbs sampling functions
XtX <- t(X)%*%X
PostBeta <- function(Yl, sig2){
Bn <- solve(B0i + sig2^(-1)*XtX)
bn <- Bn%*%(B0i%*%b0 + sig2^(-1)*t(X)%*%Yl)
Beta <- MASS::mvrnorm(1, bn, Bn)
return(Beta)
}
PostYl <- function(Beta, L, U, i){
Ylmean <- X[i,]%*%Beta
if(y[i] == L){
Yli <- truncnorm::rtruncnorm(1, a = -Inf, b = L, mean = Ylmean, sd = sig2^0.5)
}else{
if(y[i] == U){
Yli <- truncnorm::rtruncnorm(1, a = U, b = Inf, mean = Ylmean, sd = sig2^0.5)
}else{
Yli <- y[i]
}
}
return(Yli)
}
PostSig2 <- function(Beta, Yl){
an <- a0 + length(y)
dn <- d0 + t(Yl - X%*%Beta)%*%(Yl - X%*%Beta)
sig2 <- invgamma::rinvgamma(1, shape = an/2, rate = dn/2)
return(sig2)
}
PostBetas <- matrix(0, mcmc+burnin, k); Beta <- rep(0, k)
PostSigma2 <- rep(0, mcmc+burnin); sig2 <- 1
L <- log(1000000); U <- Inf
# create progress bar in case that you want to see iterations progress
pb <- winProgressBar(title = "progress bar", min = 0, max = tot, width = 300)
for(s in 1:tot){
Yl <- sapply(1:N, function(i){PostYl(Beta = Beta, L = L, U = U, i)})
Beta <- PostBeta(Yl = Yl, sig2)
sig2 <- PostSig2(Beta = Beta, Yl = Yl)
PostBetas[s,] <- Beta; PostSigma2[s] <- sig2
setWinProgressBar(pb, s, title=paste( round(s/tot*100, 0), "% done"))
}
close(pb)
## NULL
keep <- seq((burnin+1), tot, thin)
PosteriorBetas <- PostBetas[keep,]
colnames(PosteriorBetas) <- c("Intercept", "Perf", "Age", "Age2", "NatTeam", "Goals", "Exp", "Exp2")
summary(coda::mcmc(PosteriorBetas))
##
## Iterations = 1:50000
## Thinning interval = 1
## Number of chains = 1
## Sample size per chain = 50000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## Intercept 1.023783 2.640151 1.181e-02 1.649e-02
## Perf 0.033982 0.004496 2.011e-05 2.198e-05
## Age 1.026527 0.213505 9.548e-04 1.322e-03
## Age2 -0.021902 0.004005 1.791e-05 2.520e-05
## NatTeam 0.849435 0.125673 5.620e-04 6.394e-04
## Goals 0.010093 0.001652 7.390e-06 7.745e-06
## Exp 0.175137 0.070498 3.153e-04 3.877e-04
## Exp2 -0.005685 0.002985 1.335e-05 1.578e-05
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## Intercept -4.218202 -0.737377 1.048790 2.818320 6.1265295
## Perf 0.025253 0.030947 0.033956 0.036986 0.0428926
## Age 0.612543 0.881473 1.024929 1.169122 1.4478207
## Age2 -0.029837 -0.024583 -0.021874 -0.019170 -0.0141402
## NatTeam 0.603543 0.764706 0.849179 0.933046 1.0994449
## Goals 0.006808 0.008979 0.010097 0.011208 0.0133334
## Exp 0.038449 0.127461 0.174294 0.223045 0.3138970
## Exp2 -0.011563 -0.007695 -0.005659 -0.003663 0.0001001
##
## Iterations = 1:50000
## Thinning interval = 1
## Number of chains = 1
## Sample size per chain = 50000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## 0.9860750 0.0963754 0.0004310 0.0006757
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## 0.8156 0.9186 0.9792 1.0469 1.1937
References
We can set \(L\) or \(U\) equal to \(-\infty\) or \(\infty\) to model data censored on just one side.↩︎