Estimating Selection Models

Author

Christopher P Adams

Introduction

Chapter 5 introduced the idea of using the economic assumption of revealed preference for estimating policy effects. Berkeley econometrician, Daniel McFadden, won the Nobel prize in economics for his work using revealed preference to estimate demand. McFadden was joined in the Nobel prize by University of Chicago econometrician, James Heckman. Heckman won for his work advocating the use of **revealed preference} to a broader range of problems.

Chapter 6 considers two related problems, censoring and selection. Censoring occurs when the value of a variable is limited due to some constraint. For example, we tend not to see wages below the federal minimum wage. The chapter shows our estimates can be biased when our statistical models expect the variable to go below the censored level. A standard method to account for censoring is to combine a probit with OLS. This combined model is called a Tobit. The chapter estimates a wage regression similar to Chapter 2’s analysis of returns to schooling. The difference is that here, the regression accounts for censoring of wages at the minimum wage.

The selection problem is a generalization of the censoring problem. The data is censored due to some sort of “choice.”¹ While McFadden considered problems where the choice was observed but the outcomes were not. Heckman examined a question where both the choice and the outcome of that choice are observed.

The chapter uses Heckman’s model to analyze the gender-wage gap. The concern is that observed difference in wages by gender may underestimate the actual difference. Traditionally, many women did not have paid work because they have childcare or other uncompensated responsibilities. Whether or not a woman works full-time depends on the wage she is offered. We only observed the offers that were accepted, which means the offers are “selected.” The Heckman model allows us to account for the “choice” of these women to work.

In addition to analyzing the gender wage gap, the chapter returns to the question of measuring returns to schooling. The chapter uses a version of the Heckman selection model to estimate the joint distribution of potential wages for attending college and not attending college.

Modeling Censored Data

Censoring refers to the issue that a variable is set to an arbitrary value such as 0. Say for example, that a variable must always have a positive value. When we look at hours worked, the values are all positive. The minimum number of hours a person can work is zero. Such restrictions on the values can make it difficult to use OLS and other methods described in the previous chapters.

The section presents the latent value model and the Tobit estimator.

A Model of Censored Data

Consider a model eerily similar to the model presented in the previous chapter. There is some latent outcome ($y_i^*$), where if this value is large enough we observe $y_i = y_i^*$. The variable $y^*$ could be estimated with a standard OLS model. However, $y^*$ is not fully observed in the data. Instead $y$ is observed, where $y$ is equal to $y^*$ when that latent variable is above some threshold. Otherwise, the variable is equal to the threshold. Here the variable is censored at 0.

\[ \begin{array}{l} y_i^* = a + b x_i + \upsilon_i\\ \\ y_i = \left \{\begin{array}{l} y_i^* \mbox{ if } y_i^* \ge 0\\ 0 \mbox{ if } y_i^* < 0 \end{array} \right. \end{array} \tag{1}\]

We can think of $y_i^*$ as representing hours worked like in the example above.

Simulation of Censored Data

Consider a simulated version of the model presented above.

set.seed(123456789)  
N <- 500
a <- 2
b <- -3
x <- runif(N)
u <- rnorm(N)
ystar <- a + b*x + u
y <- ifelse(ystar > 0, ystar, 0)
lm1 <- lm(y ~ x)
lm1$coefficients[2]

        x 
-2.026422

Figure 1: Plot of $x$ and $y$ with the relationship between $y^*$ and $x$ represented by the solid line. The estimated relationship between $x$ and $y$ is represented by the dashed line.

Figure Figure 1 shows that the relationship estimated with OLS is quite different from the true relationship. The true relationship has a slope of -3, while the estimated relationship is much flatter with a slope of -2. Can you see the problem?

OLS does not provide a correct estimate of the relationship because the data is censored. Only positive values of $y$ are observed in the data, but the true relationship implies there would in fact be negative values.

The implication is that our method of averaging, discussed in Chapter 1, no longer works. One solution is to limit the data so that it is not censored. Figure Figure 1 suggests that for values of $x$ below 0.6, the data is mostly not censored. This gives an unbiased estimate but at the cost of getting rid of more than half of the data. The term biased refers to whether we expect the estimated value to be equal to the true value. If it is biased, then we don’t expect them to be equal. If it is unbiased we do expect them to be equal.² We generally use the term efficient, to refer to the likely variation in our estimate due to sampling error. Limiting the data makes our estimate less biased and less efficient.

lm(y[x < 0.6] ~ x[x < 0.6])$coef[2]

x[x < 0.6] 
 -2.993644

length(y[x < 0.6])

[1] 291

In this case, limiting the sample to the data that is not censored leads to an estimate close to the true value of -3.

Another solution is to ignore the exact amount of positive values and estimate a probit. If we simplify the problem by setting all positive values to 1 we can use a standard probit. Again our estimate is also not efficient. We have thrown away a bunch of information about the value of $y$ when it is not censored.

glm(y > 0 ~ x, family = binomial(link = "probit"))$coefficients

(Intercept)           x 
   2.015881   -3.013131

Again, despite throwing away information, the probit gives results that are pretty close to the true values of 2 and -3.

The solution presented below is to use a probit to account for the censoring and estimate OLS on the non-censored data. In particular, the Tobit is a maximum likelihood estimator that allows the two methods to be combined in a natural way. The estimator also uses all the information and so is more efficient than the solutions presented above.

Latent Value Model

Figure 2: Histogram of the observed values of $y$.

One way to correct our estimate is to determine what the censored values of $y^*$ are. At the very least, we need to determine the distribution of the latent values. The Figure 2 presents the histogram of the observed values of $y$. While we observe the uncensored distribution of $y^*$, we have no idea what the censored distribution of $y^*$ looks like.
However, we may be willing to make an assumption about the shape of the distribution. In that case, it may be possible to estimate the distribution of the missing data using information from the data that is not censored.

The latent value model is very similar to the demand model presented in the previous chapter. In both cases, there is some **latent} value that we are interested in measuring. We can write out the data generating process.

\[ \begin{array}{l} y_i^* = \mathbf{X}_i' \beta + \upsilon_{i}\\ y_i = \left \{ \begin{array}{l} y_i^* \mbox{ if } y_i^* \ge 0\\ 0 \mbox{ if } y_i^* < 0 \end{array} \right . \end{array} \tag{2}\]

where $y_i$ is the outcome of interest observed for individual $i$, $y_i^*$ is the latent outcome of interest, $\mathbf{X}_i$ are observed characteristics that may both determine the latent outcome and the probability of the outcome being censored, $\beta$ is a vector of parameters, and $\upsilon_i$ represents the unobserved characteristics. In the demand model, the parameters represented each individual’s preferences. In the case analyzed below, the latent outcome is the wage rate and the observed characteristics are the standard variables related to observed wages such as education, age and race.

If the unobserved characteristic of the individual is high enough, then the outcome is not censored. In that case, we have the OLS model. As shown in the previous chapter, if the unobserved term is distributed normally, $\upsilon_i \sim \mathcal{N}(0, \sigma^2)$, then we have the following likelihood of observing the data.

\[ L(y_i, \mathbf{X}_i | y_i > 0, \beta, \sigma) = \frac{1}{\sigma} \phi(z_i) \tag{3}\]

where

\[ z_i = \frac{y_i - \mathbf{X}_i' \beta}{\sigma} \tag{4}\]

where the standard normal density is denoted by $\phi$ and we need to remember that the density is a derivative, so there is an extra $\sigma$.

For the alternative case, we can use the probit model.

\[ L(y_i, \mathbf{X}_i | y_i=0, \beta) = \Phi(-\mathbf{X}_i' \beta) \tag{5}\]

The censored model with the normality assumption is called a Tobit. The great econometrician, Art Goldberger, named it for the great economist, James Tobin, and the great limited dependent variable model, the probit (Enami and Mullahy 2009).³

Tobit Estimator

We can write out the Tobit estimator by combining the ideas from the maximum likelihood estimator of OLS and the probit presented in Chapter 5.

That is, find the parameters ($\beta$ and $\sigma$) that maximize the probability that the model predicts the observed data.

\[ \begin{array}{l} \max_{\beta, \sigma} \sum_{i=1}^N \mathbb{1}(y_i = 0)\log (\Phi(z_i)) + \mathbb{1}(y_i > 0)(\log(\phi(z_i)) - \log(\sigma))\\ s.t. z_i = \frac{y_i -\mathbf{X}_i' \beta}{\sigma} \end{array} \tag{6}\]

The notation $\mathbb{1}()$ is an indicator function. It is equal to 1 if what is inside the parentheses is true, 0 if false. As stated in Chapter 5, it is better for the computer to maximize the log-likelihood rather than the likelihood. In addition, I made a slight change to the “probit” part which makes the description closer to the estimator in R.

Unlike the probit model we have an extra term for the distribution of the unobserved term ($\sigma$). This parameter is not identified in a discrete choice model but it is identified in a Tobit model. We need to be careful that we correctly include the effect of this parameter in the log likelihood function.

Tobit Estimator in `R`

We can use Equation 6 as pseudo-code for the estimator in R.

f_tobit <- function(para, y, X) {
  X <- cbind(1,X)
  sig <- exp(para[1]) # use exp() to keep value positive.
  beta <- para[2:length(para)]
  is0 <- y == 0
  # indicator function for y = 0.
  z <- (y - X%*%beta)/sig
  log_lik <- -sum(is0*log(pnorm(z)) + 
                    (1 - is0)*(log(dnorm(z)) - log(sig)))
  # note the negative because we are minimizing.
  return(log_lik)
}

par1 <- c(0,lm1$coefficients)
a1 <- optim(par=par1,fn=f_tobit,y=y,X=x)
exp(a1$par[1])

          
0.9980347

a1$par[2:length(a1$par)]

(Intercept)           x 
   2.010674   -3.012778

The model estimates the three parameters pretty accurately, the shape parameter, $\hat{\sigma} = 0.92$ compared to a true value of 1, while $\hat{\beta}$ is 2.17 and -2.90, which is pretty close to the true values of 2 and -3. To be clear, in the simulated data the unobserved term is in fact normally distributed, an important assumption of the Tobit model. What happens if you estimate this model with some other distribution?

Censoring Due To Minimum Wages

One of the standard questions in labor economics is determining the effect on earnings of an individual’s characteristics, like education, age, race, and gender. For example, we may be interested in whether women are paid less than men for the same work. A concern is that our wage data may be censored. For example, in July 2009 the federal government increased the federal minimum wage to $7.25. That is, it was generally illegal to pay people less than $7.25.⁴

The section uses information on wages from 2010 and compares OLS to Tobit estimates.

National Longitudinal Survey of Youth 1997

The National Longitudinal Survey of Youth 1997 (NLSY97) is a popular data set for applied microeconometrics and labor economics. The data follows about 8,000 individuals across 18 years beginning in 1997. At the start of the data collection, the individuals are teenagers or in their early 20s.⁵

x <- read.csv("NLSY97_min.csv", as.is = TRUE)
x$wage <- 
  ifelse(x$CVC_HOURS_WK_YR_ALL.10 > 0 & x$YINC.1700 > 0, 
                 x$YINC.1700/x$CVC_HOURS_WK_YR_ALL.10,NA)
x$wage <- as.numeric(x$wage)
x$wage <- ifelse(x$wage < quantile(x$wage,0.90, 
                                   na.rm = TRUE), x$wage,NA) 
# topcode at the 90th percentile
# this done to remove unreasonably high measures of wages.
x$lwage <- ifelse(x$wage > 7.25, log(x$wage), 0)
# note the 0s are used as an indicator in the Tobit function.
x$ed <- ifelse(x$CV_HGC_EVER_EDT>0 & 
                 x$CV_HGC_EVER_EDT < 25,x$CV_HGC_EVER_EDT,NA)
# removed very high values of education.
x$exp <- 2010 - x$KEY.BDATE_Y - x$ed - 6
x$exp2 <- (x$exp^2)/100
# division makes the reported results nicer.
x$female <- x$KEY.SEX==2
x$black <- x$KEY.RACE_ETHNICITY==1
index_na <- is.na(rowSums(cbind(x$lwage,x$wage,x$ed,x$exp,
                                x$black,x$female)))==0
x1 <- x[index_na,]

To illustrate censoring we can look at wage rates for the individuals in NLSY97. Their **average wage rate} is calculated as their total income divided by the total number hours worked for the year. In this case, income and wages are measured in 2010. The code uses the censored variable where log wage is set to 0 unless the wage rate is above $7.25 per hour.

Figure 3: Histogram of the observed wages in 2010 from NLSY97. The vertical line is at the minimum wage of $7.25.

The histogram in Figure 3 gives some indication of the issue. We see that the distribution of wages is not symmetric and there seems to be higher than expected frequency just above the federal minimum wage. Actually, what is surprising is that there is a relatively large number of individuals with average wages below the minimum. It is unclear why that is, but it may be due to reporting errors or cases where the individual is not subject to the law.

Tobit Estimates

Even though it may provide inaccurate results, it is always useful to run OLS. Here we use it as a comparison to see if the censoring affects our measurement of how education, gender, and race affect wages. Following the argument presented above, it is useful to also see the probit estimates. The probit accounts for some of the impact of censoring although it throws away a lot of information.

lm1 <- lm(lwage ~ ed + exp + exp2 + female + black, data=x1)
glm1 <- glm(lwage > 0 ~ ed + exp + exp2 + female + black, 
             data=x1, family=binomial(link="probit"))

Comparing OLS to probit and Tobit estimates we can see how the censoring affects standard estimates of returns to experience, schooling, race and gender.

par1 <- c(0,glm1$coefficients)
a1 <- optim(par=par1,fn=f_tobit,y=x1$lwage,
            X=cbind(x1$ed,x1$exp,x1$exp2,x1$female,x1$black),
            control = list(trace=0,maxit=10000))
res_tab <- cbind(lm1$coefficients,glm1$coefficients,
                 a1$par[2:length(a1$par)])
res_tab <- rbind(res_tab,c(1,1,exp(a1$par[1])))
colnames(res_tab) <- c("OLS Est","Probit Est","Tobit Est")
rownames(res_tab) <- c("intercept","ed","exp","exp sq",
                       "female","black","sigma")

The Table 1 presents a comparison of wage rate regressions between OLS, probit and the Tobit. Note that the values for $\sigma$ are arbitrarily set to 1 for the OLS and probit. The Tobit estimates suggest that being female and being black lead to lower wages. The results suggest that censoring is leading to even greater effects than OLS would suggest.

Table 1: OLS, probit and Tobit estimates on 2010 wage rates.

	OLS Est	Probit Est	Tobit Est
intercept	-0.1285217	-1.5149073	-0.6399135
ed	0.1359129	0.1448531	0.1633185
exp	0.1173622	0.1223768	0.1209332
exp sq	-0.4039422	-0.4198480	-0.3505662
female	-0.1681297	-0.1956350	-0.1920443
black	-0.2836382	-0.3134971	-0.3339733
sigma	1.0000000	1.0000000	1.2218445

Modeling Selected Data

The Tobit model is about censoring. A close cousin of censoring is selection. In both cases we can think of the problem as having missing data. The difference is the reason for the missingness. In censoring, the data is missing because the outcome variable of interest is above or below some threshold. With selection, the data is missing because the individuals in the data have made a choice or have had some choice made for them.

Consider the problem of estimating returns to schooling for women. Compared to males, a large share of the female population don’t earn wages. We have a selection problem if this choice is determined by how much these women would have earned if they had chosen to work. The observed distribution of wages for women may be systematically different than the unobserved distribution of wage offers. This difference may lead us to underestimate the gender wage gap.

A Selection Model

Consider a model similar to the one presented above. There is some latent outcome ($y_i^*$). This time, however $y_i = y_i^*$ if some other value $z_i$ is large enough.

\[ \begin{array}{l} y_i^* = a + b x_i + \upsilon_{2i}\\ \\ y_i = \left \{\begin{array}{l} y_i^* \mbox{ if } c + d z_i + \upsilon_{1i} \ge 0\\ 0 \mbox{ if } c + d z_i + \upsilon_{1i} < 0 \end{array} \right. \end{array} \tag{7}\]

We can think of $y_i^*$ as representing hours worked, where the individual decides to work full-time based on factors such as child care cost or availability.

Note if $z_i = x_i$, $a = c$, $b = d$ and $\upsilon_{1i} = \upsilon_{2i}$ then this is exactly the same model as presented above.

Simulation of a Selection Model

Consider a simulated data set that is similar to the data created for the Tobit model. One difference is the $z$ variable. Where previously, whether or not $y$ was observed depended on $x$ and $u_2$. Here it depends on $z$, $u_1$ and $u_2$. Importantly, neither $z$ nor $u_1$ determine the observed value of $y$. In this data, the $z$ variable is determining whether or not the $y$ variable is censored.

require(mvtnorm)

Loading required package: mvtnorm

Warning: package 'mvtnorm' was built under R version 4.4.1

set.seed(123456789)  
N <- 100
a <- 6
b <- -3
c <- 4
d <- -5
x <- runif(N)
z <- runif(N)
mu <- c(0,0)
sig <- rbind(c(1,0.5),c(0.5,1))
u <- rmvnorm(N, mean=mu, sigma=sig)
# creates a matrix with two correlated random variables.
y <- ifelse(c + d*z + u[,1] > 0, a + b*x + u[,2], 0)
x1 <- x
y1 <- y
lm1 <- lm(y1 ~ x1)
x1 <- x[z < 0.6]
y1 <- y[z < 0.6]
lm2 <- lm(y1 ~ x1)
y1 <- y > 0
glm1 <- glm(y1 ~ z, family = binomial(link = "probit"))

Table 2: Estimates on the simulated data where Model (1) is an OLS model, Model (2) is an OLS model on data where $z < 0.6$ and Model (3) is a probit.

	(1)	(2)	(3)
(Intercept)	4.397	6.039	4.025
	(0.426)	(0.301)	(0.869)
x1	-1.314	-3.358
	(0.741)	(0.517)
z			-4.905
			(1.150)
Num.Obs.	100	64	100
R2	0.031	0.405
R2 Adj.	0.021	0.395
AIC	440.9	205.6	58.2
BIC	448.7	212.0	63.4
Log.Lik.	-217.458	-99.777	-27.112
RMSE	2.13	1.15	0.30

The Table 2 presents OLS and probit regression results for the simulated data. The outcome variable $y$ is regressed on the two explanatory variables $x$ and $z$. Note that in the simulation $z$ only affects whether or not $y$ is censored, while $x$ affects the outcome itself. The results show that simply regressing $y$ on $x$ will give biased results. Model (1) is the standard OLS model with just the $x$ variable. The estimate on the coefficient is far from the true value of -3. This is due to the censoring, similar to the argument made above. Again, we can restrict the data to observations less likely to be impacted by the censoring. Model (2) does this by restricting analysis to observations with small values of $z$. This restriction gives estimates that are pretty close to the true values. The probit model (Model (3)) accounts for the impact of $z$ on $y$. The results from this model are pretty close to the true values.

Heckman Model

In algebra, the selection model is similar to the Tobit model.

\[ y_i = \left \{ \begin{array}{l} \mathbf{X}_i' \beta + \upsilon_{2i} \mbox{ if } \upsilon_{1i} \ge -\mathbf{Z}_i' \gamma\\ 0 \mbox{ if } \upsilon_{1i} < -\mathbf{Z}_i' \gamma \end{array} \right . \tag{8}\]

where $\{\upsilon_{1i}, \upsilon_{2i}\} \sim \mathcal{N}(\mu, \Sigma)$, $\mu = \{0, 0\}$ and

\[ \Sigma = \left[\begin{array}{cc} 1 & \rho\\ \rho & 1 \end{array}\right] \tag{9}\]

In the Heckman model the “decision” to work or not, is dependent on a different set of observed and unobserved characteristics represented by $\mathbf{Z}$ and $\upsilon$. Note that these could be exactly the same as the Tobit. It may be that $\mathbf{X} = \mathbf{Z}$ and $\rho = 1$. That is, the Heckman model is a generalization of the Tobit model. In the example of women in the workforce, the Heckman model allows things like childcare costs to determine whether women are observed earning wages. It also allows unobserved characteristics that affect the probability of being in the workforce also affect the wages earned.

Heckman Estimator

As with the Tobit model, we can use maximum likelihood to estimate the model. The likelihood of observing the censored value $y_i = 0$ is determined by a probit.

\[ L(y_i, \mathbf{Z}_i | y_i=0, \gamma) = \Phi(-\mathbf{Z}_i' \gamma) \tag{10}\]

The likelihood of observing a censored value of $y$ is determined by the $Z$s and the vector $\gamma$. Note again, the variance of the normal distribution is not identified in the probit model and so is set to 1.

The likelihood of a strictly positive value of $y_i$ is more complicated than for the censored value. If the unobserved terms determining whether $y$ is censored ($\upsilon_{1}$) and its value ($\upsilon_{2}$) are independent then this likelihood would be as for the Tobit. However, the two unobserved terms are not independent. This means that we need to condition on $\upsilon_1$ to determine the likelihood of $\upsilon_2$.

Unfortunately, it is rather gruesome to write down the likelihood in this way. Therefore, we take advantage of Bayes’s rule and write down the likelihood of $\upsilon_1$ conditional on the value of $\upsilon_2$.⁶

\[ L(y_i, \mathbf{X}_i, \mathbf{Z}_i | y_i > 0, \beta, \gamma, \rho) = \left(1 - \Phi\left(\frac{-\mathbf{Z}_i'\gamma - \rho \upsilon_{2i}}{\sqrt(1 - \rho^2)}\right)\right) \phi(\upsilon_{2i}) \tag{11}\]

where $\upsilon_{2i} = y_i - \mathbf{X}_i' \beta$. Note how the conditioning is done in order to keep using the standard normal distribution.⁷

We want to find the parameters ($\beta$, $\gamma$ and $\rho$) that maximize the probability that the model predicts the observed data.

\[ \begin{array}{l} \max_{\beta, \gamma, \rho} \sum_{i=1}^N \mathbb{1}(y_i = 0)\log \left(\Phi \left(-\mathbf{Z}_i' \gamma \right) \right)\\ + \mathbb{1}(y_i > 0) \left(\log \left(\Phi\left(\frac{\mathbf{Z}_i' \gamma + \rho (y_i - \mathbf{X}_i' \beta)}{\sqrt(1 - \rho^2)}\right) \right) \right.\\ + \left. \log\left(\phi(y_i - \mathbf{X}_i' \beta) \right)\right) \end{array} \tag{12}\]

Note that I have taken advantage of the fact that normals are symmetric.

To make this presentation a little less messy, I assumed that the distribution of unobserved terms is a bivariate standard normal. That is, the variance terms are 1. As with the probit, the variance of $\upsilon_1$ is not identified, but as with the Tobit, the variance of $\upsilon_2$ is identified.⁸ Nevertheless, here it is assumed to be 1.

Heckman Estimator in `R`

The code for the Heckman estimator is very similar to the code for the Tobit estimator. The difference is that this estimator allows for a set of characteristics that determine whether or not the outcome variable is censored.

f_heckman <- function(par,y, X_in, Z_in = X_in) {
  # defaults to Z_in = X_in
  X <- cbind(1,X_in)
  Z <- cbind(1,Z_in)
  is0 <- y == 0 # indicator function
  rho <- exp(par[1])/(1 + exp(par[1]))
  # this is the sigmoid function
  # Note that in actual fact rho is between -1 and 1.
  beta <- par[2:(1+dim(X)[2])]
  gamma <- par[(2 + dim(X)[2]):length(par)]
  Xb <- X%*%beta
  Zg <- Z%*%gamma
  Zg_adj <- (Zg + rho*(y - Xb))/((1 - rho^2)^(.5))
  log_lik <- is0*log(pnorm(-Zg)) + 
    (1 - is0)*(log(pnorm(Zg_adj)) + 
                 log(dnorm(y - Xb)))
  return(-sum(log_lik))
}
par1 <- c(0,lm1$coefficients,glm1$coefficients)
a2 <- optim(par=par1,fn=f_heckman,y=y,X=x,Z=z)
# rho
exp(a2$par[1])/(1 + exp(a2$par[1]))

          
0.5357194

# beta
a2$par[2:3]

(Intercept)          x1 
   5.831804   -2.656908

# gamma
a2$par[4:5]

(Intercept)           z 
   4.406743   -5.488539

The Heckman estimator does a pretty good job of estimating the true parameters. The true $\beta = \{6, -3\}$, while the true $\gamma = \{4, -5\}$ and $\rho = 0.5$.

Analyzing the Gender Wage Gap

We can analyze the difference in wages between men and women using the NLSY97. Here we use the data from 2007 in order to minimize issues due to censoring. The analysis is also limited to “full-time” workers, those working more than an average of 35 hours per week.⁹ A Heckman model is used to adjust for selection.

NLSY97 Data

The data is from NLSY97 with hours and income from 2007.¹⁰ The analysis is limited to individuals working more than 1750 hours per year.

x <- read.csv("NLSY97_gender_book.csv")
x$wage <- 
  ifelse(x$CVC_HOURS_WK_YR_ALL.07_XRND > 0, 
         x$YINC.1700_2007/x$CVC_HOURS_WK_YR_ALL.07_XRND, 0)
x$lwage <- ifelse(x$wage > 1, log(x$wage), 0)
x$fulltime <- x$CVC_HOURS_WK_YR_ALL.07_XRND > 1750
x$lftwage <- ifelse(x$lwage > 0 & x$fulltime, x$lwage, 0)
x$female <- x$KEY_SEX_1997==2
x$black <- x$KEY_RACE_ETHNICITY_1997==1
x$age <- 2007 - x$KEY_BDATE_Y_1997
x$age2 <- x$age^2
x$college <-  x$CV_HIGHEST_DEGREE_0708_2007 >= 3
x$south <- x$CV_CENSUS_REGION_2007==3
x$urban <- x$CV_URBAN.RURAL_2007==1
x$msa <- x$CV_MSA_2007 > 1 & x$CV_MSA_2007 < 5
x$married <- x$CV_MARSTAT_COLLAPSED_2007==2
x$children <- x$CV_BIO_CHILD_HH_2007 > 0
index_na <- is.na(rowSums(cbind(x$black,x$lftwage,x$age,
                                x$msa, x$urban,x$south,
                                x$college,x$female,
                                x$married,x$children)))==0
x1 <- x[index_na,]
x1_f <- x1[x1$female,]
x1_m <- x1[!x1$female,]
# split by gender

The Figure 4 presents the log densities of wage rates for male and female full-time workers. It shows that female wages are shifted down, particularly at the low and high end. The question from the analysis above is whether the true difference is much larger. Is the estimated distribution of wages for females biased due to selection? Asked another way, is the distribution of female wages shifted up relative to the distribution of female wage offers?

Figure 4: Density of full-time log wages by gender

Part of the explanation for the difference may be differences in education level, experience or location. We can include these additional variables in the analysis.

The regressions presented in Table 3 suggest that there is a substantial female wage gap. This is shown in two ways, first by comparing Model (1) to Model (2). These are identical regressions except that Model (1) is just on male wage earners and Model (2) is just on female wage earners. The regressions are similar except for the intercept term. Model (2) is substantially shifted down relative to Model (1). The second way is by simply adding a dummy for female in Model (3). The negative coefficient on the dummy also suggests that there is a substantial gender wage gap, even accounting for education and experience.

lm1 <- lm(lftwage ~ age + age2 + black + college +
            south + msa, data=x1_m)
lm2 <- lm(lftwage ~ age + age2 + black + college +
            south + msa, data=x1_f)
lm3 <- lm(lftwage ~ female + age + age2 + black + college +
            south + msa, data=x1)

Table 3

	(1)	(2)	(3)
(Intercept)	-12.470	-5.341	-3.297
	(14.382)	(10.084)	(8.412)
age	0.990	0.430	0.292
	(1.148)	(0.807)	(0.672)
age2	-0.017	-0.008	-0.004
	(0.023)	(0.016)	(0.013)
blackTRUE	-0.660	0.020	-0.255
	(0.083)	(0.057)	(0.048)
collegeTRUE	0.547	0.620	0.596
	(0.133)	(0.082)	(0.072)
southTRUE	0.053	-0.025	-0.012
	(0.080)	(0.055)	(0.047)
msaTRUE	0.030	0.038	0.048
	(0.143)	(0.120)	(0.093)
femaleTRUE			-0.660
			(0.045)
Num.Obs.	1076	1555	2631
R2	0.097	0.042	0.115
R2 Adj.	0.092	0.038	0.113
AIC	3497.8	4563.6	8134.8
BIC	3537.6	4606.4	8187.7
Log.Lik.	-1740.898	-2273.810	-4058.398
RMSE	1.22	1.04	1.13

OLS estimates of log wages for full-time workers in 2007 from NLSY97. Model (1) is for male workers. Model (2) is for female workers. Model (3) includes both genders but a dummy variable for female.

Choosing To Work

While the previous analysis suggests a substantial gender wage gap, that gap may be underestimated. This would occur if the women observed in the work force were the ones more likely to have received higher wage offers. The first step to estimate the distribution of wage offers is to estimate the “choice to work.” I put choice in quotations because I am not assuming that all women are actually making a choice. I am assuming that whether a woman works or not depends in part on what the woman expects to earn.

glm1 <- glm(lftwage > 0 ~ college, data = x1_f)
glm2 <- glm(lftwage > 0 ~ college + south, data = x1_f)
glm3 <- glm(lftwage > 0 ~ college + south + married +
              children, data = x1_f)

Table 4

	(1)	(2)	(3)
(Intercept)	0.284	0.278	0.157
	(0.012)	(0.016)	(0.062)
collegeTRUE	0.219	0.218	0.215
	(0.036)	(0.036)	(0.036)
southTRUE		0.014	0.015
		(0.023)	(0.023)
marriedTRUE			-0.037
			(0.089)
childrenTRUE			0.126
			(0.062)
Num.Obs.	1555	1555	1555
R2	-0.098	-0.099	-0.097
AIC	1984.9	1986.5	1986.2
BIC	2000.9	2007.9	2018.3
Log.Lik.	-989.444	-989.270	-987.115
RMSE	0.46	0.46	0.46

Probit estimates of the “choice” to work for females in NLSY97.

The Table 4 shows that having a college education substantially increases the likelihood that a woman will work. The analysis also suggests that being married and having children affects the likelihood of working although the coefficient on married is not statistically significantly different from zero and the coefficient on children is positive.

Heckman Estimates of Gender Gap

In order to use the Heckman model, the outcome variable is normalized so that it has a standard normal distribution. The model assumes that income for men and women is determined by experience, education and location. For men it is assumed that if we don’t observe a man working full-time, it is something idiosyncratic about the man. In contrast, we assume that women are selected to work full-time based on education, location, number of children and whether they are married or not.

x1$lftwage_norm <- 
  (x1$lftwage-mean(x1$lftwage))/sd(x1$lftwage)
x1$lftwage_norm3 <- ifelse(x1$lftwage==0,0,x1$lftwage_norm)
y1 <- x1$lftwage_norm3
X1 <- cbind(x1$age,x1$age2,x1$black,x1$college,
            x1$south,x1$msa)
Z1 <- cbind(x1$college,x1$south,x1$married,x1$children)
y1f <- y1[x1$female]
X1f <- X1[x1$female,]
Z1f <- Z1[x1$female,]
par1 <- c(0,lm2$coefficients,glm3$coefficients)
a2 <- optim(par=par1,fn=f_heckman,y=y1f,X=X1f,Z=Z1f,
            control = list(trace=0,maxit=10000))
y_adj <- cbind(1,X1f)%*%a2$par[2:8] + rnorm(dim(X1f)[1])
X1m <- X1[!x1$female,]
y_adj_m <- cbind(1,X1m)%*%a2$par[2:8] + rnorm(dim(X1m)[1])

Figure Figure 5 presents density estimates after accounting for selection into full-time work. It shows that the distribution of female wage offers is shifted much further down than the standard estimate. In order to account for observed differences between men and women, the chart presents a density of wage for women but with their observed characteristics set to the same values as men. Accounting for other observed differences between men and women has little effect.

Figure 5: Density of wages by gender. Selection adjusted estimate is a solid line. Adjusted for male characteristics is the dashed line.

Back to School Returns

This section returns to the question of whether an additional year of schooling increases income. This approach is similar to the IV approach presented in Chapter 3. It shows that the effect of college is heterogeneous and not always positive.

NLSM Data

We can use the NLSM data used by Card (1995) to compare the Heckman model with the IV approach.¹¹ In this analysis, a person is assumed to have gone to college if they have more than 12 years of education.

x <- read.csv("nls.csv",as.is=TRUE)  
x$lwage76 <- as.numeric(x$lwage76)

Warning: NAs introduced by coercion

x1 <- x[is.na(x$lwage76)==0,] 
x1$lwage76_norm <- 
  (x1$lwage76 - mean(x1$lwage76))/sd(x1$lwage76)
# norm log wages for Heckman model
x1$exp <- x1$age76 - x1$ed76 - 6 # working years after school
x1$exp2 <- (x1$exp^2)/100 
x1$college <- x1$ed76 > 12

College vs. No College

We can compare the differences in income for people who attended college and those that did not.

lm1c <- lm(lwage76_norm ~ exp + exp2 + black + reg76r +
             smsa76r, data=x1[x1$college,])
lm1nc <- lm(lwage76_norm ~ exp + exp2 + black + reg76r +
              smsa76r, data=x1[!x1$college,])

Table 5: OLS estimates of normalized log wages for males in the NLSM data. Model (1) is for males who attended college. Model (2) is for males who did not attend college.

	(1)	(2)
(Intercept)	-0.785	-1.231
	(0.108)	(0.209)
exp	0.175	0.182
	(0.028)	(0.036)
exp2	-0.603	-0.628
	(0.184)	(0.149)
black	-0.372	-0.565
	(0.067)	(0.051)
reg76r	-0.219	-0.469
	(0.050)	(0.050)
smsa76r	0.379	0.396
	(0.056)	(0.048)
Num.Obs.	1521	1489
R2	0.152	0.248
R2 Adj.	0.150	0.245
AIC	3970.8	3773.8
BIC	4008.1	3811.0
Log.Lik.	-1978.380	-1879.916
RMSE	0.89	0.86

Table Table 5 presents the results from OLS estimates for two groups: males that attend college and males that do not attend college. Note that this model is a little different from results presented in earlier chapters because I have normalized log wages for use in the Heckman model. Looking at the intercept term we see that the distribution of wages for males who attend college shifted up relative to the distribution of wages of males who do not.

Choosing College

Again we don’t mean that a person is literally choosing whether or not to attend college. We mean that there are unobserved characteristics of the individual that are related to both attending college and earning income once graduated.

# Probit Estimate
glm1 <- glm(college ~ nearc4, 
            family = binomial(link = "probit"), data=x1)
glm2 <- glm(college ~ nearc4 + momdad14, 
            family = binomial(link = "probit"), data=x1)
glm3 <- glm(college ~ nearc4 + momdad14 + black + smsa66r, 
         family = binomial(link = "probit"),data=x1)
glm4 <- glm(college ~ nearc4 + momdad14 + black + smsa66r
          + reg662 + reg663 + reg664 + reg665 + reg666 +
            reg667 + reg668 + reg669, 
          family = binomial(link = "probit"), data=x1)

Table 6: Probit estimates of the “choice” to attend college for males in the NLSM data. Model (4) includes regional dummies for 1966.

	(1)	(2)	(3)	(4)
(Intercept)	-0.196	-0.640	-0.442	-0.745
	(0.041)	(0.062)	(0.072)	(0.130)
nearc4	0.307	0.316	0.223	0.237
	(0.049)	(0.050)	(0.056)	(0.058)
momdad14		0.552	0.411	0.420
		(0.058)	(0.060)	(0.061)
black			-0.502	-0.466
			(0.059)	(0.064)
smsa66r			0.139	0.130
			(0.055)	(0.057)
reg662				0.233
				(0.122)
reg663				0.277
				(0.120)
reg664				0.355
				(0.142)
reg665				0.198
				(0.122)
reg666				0.327
				(0.137)
reg667				0.266
				(0.131)
reg668				0.767
				(0.182)
reg669				0.533
				(0.133)
Num.Obs.	3010	3010	3010	3010
AIC	4137.5	4046.4	3965.1	3948.4
BIC	4149.5	4064.4	3995.2	4026.5
Log.Lik.	-2066.737	-2020.206	-1977.571	-1961.176
RMSE	0.50	0.49	0.48	0.48

The Table 6 reiterates results we have seen in earlier chapters. Growing up near a 4-year college is associated with a higher probability of attending college; so is growing up with both parents. It is also related to race and where the person was living in 1966. Men living in cities were more likely to attend college.

Heckman Estimates of Returns to Schooling

The Heckman estimator is very similar to the IV estimator presented in Chapter 3.¹² I have highlighted the similarity by using the $Z$ notation. In both cases, we have a policy variable $X$ (college attendance) that is determined by instrument-like variables $Z$ (distance to college, location in 1966, and whether they had both parents at home at 14). Formally, there is no requirement about the exogeneity of the $Z$ variables. That is, there may be unobserved characteristics that determine both $Z$ and $X$. The reason is that the parameters can be determined given the parametric assumptions we have made. That said, many researchers prefer to rely on the exogeneity relative to relying on the parametric assumptions. In some sense, this model is more general than the IV model. It allows for multiple instruments, which the IV does, but only if using GMM as shown in Chapter 8.

A probit is used to determine whether the person attends college or not.¹³ The probit uses some of the instrumental variables discussed in Chapter 3 such as distance to college and whether the person lived with their parents at 14. Notice that these variables are not included in the variables determining income. Also note that the college function does not include variables that will affect income in 1976. The assumption is that the college decision had to do with factors that were true for the person in 1966, while income in 1976 has to do with factors that are true for the person in 1976.

X1 <- cbind(x1$exp,x1$exp2,x1$black,x1$reg76r,x1$smsa76r)
Z1 <- cbind(x1$nearc4,x1$momdad14,x1$black,x1$smsa66r,
            x1$reg662,x1$reg663,x1$reg664,x1$reg665,
            x1$reg666,x1$reg667,x1$reg668,x1$reg669)

# College
y1c <- x1[x1$college,]$lwage76_norm
X1c <- X1[x1$college,]
Z1c <- Z1[x1$college,]

par1 <- c(0,lm1c$coefficients,glm4$coefficients)
a_coll <- optim(par=par1,fn=f_heckman,y=y1c,X=X1c,Z=Z1c,
            control = list(trace=0,maxit=100000))

# No college
y1nc <- -x1[!x1$college,]$lwage76_norm 
# negative in order to account for correlation in income 
# and no college.
X1nc <- X1[!x1$college,]
Z1nc <- Z1[!x1$college,]

par1 <- c(0,lm1nc$coefficients,-glm4$coefficients)
a_nocoll <- optim(par=par1,fn=f_heckman,y=y1nc,X=X1nc,Z=Z1nc,
            control = list(trace=0,maxit=100000))

# Predicted wages
mu1 <- c(0,0)
rho1 <- exp(a_coll$par[1])/(1 + exp(a_coll$par[1]))
sig1 <- cbind(c(1,rho1),c(rho1,1))
u1 <- rmvnorm(dim(x1)[1],mu1,sig1)
beta_c <- a_coll$par[2:length(lm1$coefficients+1)]
beta_nc <- a_nocoll$par[2:length(lm1$coefficients+1)]
x1$coll_y <- cbind(1,X1)%*%beta_c + u1[,1]
x1$nocoll_y <- -cbind(1,X1)%*%beta_nc + u1[,2]

This is not a “full” Heckman selection model. I have estimated two separate models, one on choosing college and the other on not choosing college. Note that I have used negatives to account for the “not” decision. This is done in order to simplify the exposition. But there are some costs, including the fact that there are two different estimates of the correlation term. In the results presented in Figure 6, the correlation coefficient from the first model is used. Can you write down and estimate a full model?

Effect of College

Figure 6: Plot of predicted potential wages by college and no-college. The 45 degree line is presented.

The Heckman model suggests that the effect of college is heterogeneous. While most benefit from college, not all benefit. Figure 6 presents a plot of the predicted wage for each person in the sample, for both the case where they went to college and the case where they did not go to college. For those sitting above the 45 degree line, their predicted wage is higher going to college than not going to college. For those below the 45 degree line, it is the opposite. Their predicted wages are higher from not attending college. Given all the assumptions, it is not clear exactly what we should take from this, but it is interesting that so much weight is above the 45 degree line. Remember that only 50% of the data actually went to college. The model assumes that people are “choosing” not to go to college based on beliefs about what they would earn if they did. The model is estimated with two Heckman regressions rather than a full “switching” model which would allow for a richer error structure.

We can compare this approach to the estimates of the ATE and LATE using OLS and IV. The estimated average treatment effect is 0.14, which is substantially higher than the OLS estimate of 0.075. It is similar to the IV and LATE estimates. This should not be too surprising given how closely related the Heckman model is to the IV model.

mean(x1$coll_y - x1$nocoll_y)/4

[1] 0.1276973

Discussion and Further Reading

The chapter introduces the Tobit model which can be used to estimate data that is censored. It also introduces the Heckman selection model, which is used to estimate the gender-wage gap and returns to schooling. When using limited dependent variable models, including the ones discussed in this chapter and the previous chapter, I often go back to appropriate chapters of Greene (2000).

This chapter showed how the economic assumption of revealed preference can be used for a broader range of problems than just demand estimation. In particular, in labor economics we often have data in which we observe both a decision and the outcome of the decision. The Roy model considers the situation where an individual is choosing between two employment sectors. We observe the choice and the wages in the chosen sector. Heckman and Honore (1990) show that with enough exogenous variation in wages and revealed preference, the joint distribution of wages across both sectors is identified. Recent work considers this model under even weaker assumptions (Henry, Mourifie, and Meango 2020).

The chapter re-analyzes the returns to schooling data used in Card (1995). It uses a version of the Heckman model to find average returns similar to the IV estimates. The model also estimates the joint distribution of returns and shows that for some individuals the expected income from attending college may be less than the expected income from not attending college.

References

Card, David. 1995. “Aspects of Labour Market Behavior: Essays in Honour of John Vanderkamp.” In, edited by Louis N. Christofides, E. Kenneth Grant, and Robert Swidinsky, 201–22. University of Toronto Press.

Enami, Kohei, and John Mullahy. 2009. “Tobit at fifty: a brief history of Tobin’s remarkable estimator, of related empirical methods, and of limited dependent variable econometrics in health economics.” Health Economics 18 (6): 619–28.

Goldberger, Arthur. 1991. A Course in Econometrics. Harvard University Press.

Greene, William. 2000. Econometric Analysis. Fourth. Prentice Hall.

Heckman, James, and Bo Honore. 1990. “The Empirical Content of the Roy Model.” Econometrica 58: 1128–49.

Henry, Marc, Ismael Mourifie, and Romuald Meango. 2020. “Sharp bounds and testability of a Roy model of STEM major choices.” Journal of Political Economy.

Footnotes

We will use the term “choice” to refer to assignment to treatment that is associated with some sort of economic decision. It does not mean that the observed individual actually had a “choice.”↩︎
See Appendix A for a longer discussion of what it means for an estimator to be unbiased.↩︎
According to former Tobin students, the estimator once took an army of graduate research assistants armed with calculators to estimate the model.↩︎
There are various exemptions such as for employees receiving tips.↩︎
This data and other similar data sets are available from the Bureau of Labor Statistics here: https://www.nlsinfo.org/investigator/pages/login.jsp. This version can be downloaded from here: https://sites.google.com/view/microeconometricswithr/table-of-contents ↩︎
When dealing with joint distributions it is useful to remember the relationship $\Pr(A, B) = \Pr(A | B) \Pr(B) = \Pr(B | A) \Pr(A)$ where $A$ and $B$ represent events associated with each of two random variables. The term on the left-hand side represents the joint distribution. The middle term is the probability of observing A conditional on the value of B, multiplied by the probability of observing B. The term on the far right is the other way around.↩︎
I can never remember exactly how to do this, so I always keep a copy of Goldberger (1991) close by. In R you can sometimes get around remembering all this and use a package like mvtnorm to account for multivariate normal distributions.↩︎
See discussion of the bivariate probit in the previous chapter.↩︎
Assuming a 50 week work-year.↩︎
Again, this data set is available from the Bureau of Labor Statistics here: https://www.nlsinfo.org/investigator/pages/login.jsp. This version can be downloaded from here: https://sites.google.com/view/microeconometricswithr/table-of-contents ↩︎
See a discussion of this data in Chapters 1 and 2.↩︎
I have simplified things somewhat in order to use the Heckman model presented above.↩︎
This is often called a control function, and the approach is called a control function approach.↩︎