Ordinary Least Squares

Author

Christopher P. Adams

Introduction

Ordinary least squares (OLS) is the work-horse model of microeconometrics. It is quite simple to estimate. It is straightforward to understand. It presents reasonable results in a wide variety of circumstances.

The chapter uses OLS to disentangle the effect of the policy variable from other unobserved characteristics that may be affecting the outcome of interest. The chapter describes how the model works and presents the two standard algorithms, one based on matrix algebra and the other based on least squares. It uses OLS on actual data to estimate the value of an additional year of high school or college.

Estimating the Causal Effect

You have been asked to evaluate a policy which would make public colleges free in the state of Vermont. The Green Mountain state is considering offering free college education to residents of the state. Your task is to determine whether more Vermonters will attend college and whether those additional attendees will be better off. Your boss narrows things down to the following question. Does an additional year of schooling cause people to earn more money?

The section uses simulated data to illustrate how averaging is used to disentangle the effect of the policy variable from the effect of the unobserved characteristics.

Graphing the Causal Effect

Your problem can be represented by the causal graph in Figure 1. The graph shows an arrow from \(X\) to \(Y\) and an arrow from \(U\) to \(Y\). In the graph, \(X\) represents the policy variable. This is the variable that we are interested in changing. In the example, it represents years of schooling. The outcome of interest is represented by \(Y\). Here, the outcome of interest is income. The arrow from \(X\) to \(Y\) represents the causal effect of \(X\) on \(Y\). This means that a change in \(X\) will lead to a change in \(Y\). The size of the causal effect is represented by \(b\). The model states that if a policy encourages a person to gain an additional year of schooling (\(X\)), then the person’s income (\(Y\)) increases by the amount \(b\). Your task is to estimate \(b\).

If the available data is represented by the causal graph in Figure 1, can it be used to estimate \(b\)? The graph shows there is an unobserved term \(U\) that is also affecting \(Y\). \(U\) may represent unobserved characteristics of the individual or the place where they live. In the example, some people live in Burlington while others live in Northeast Kingdom. Can we disentangle the effect of \(X\) on \(Y\) and the effect of \(U\) on \(Y\)?

A Linear Causal Model

Consider a simple model of our problem. Individual \(i\) earns income \(y_i\) determined by their education level \(x_i\) and unobserved characteristics \(\upsilon_i\). We assume that the relationship is a linear function.

\[ y_i = a + b x_i + \upsilon_i \tag{1}\]

where \(a\) and \(b\) are the parameters that determine how much income individual \(i\) earns and how much of that is determined by their level of education.

Our goal is to estimate these parameters from the data we have.

Simulation of the Causal Effect

In the simulated data, we have a true linear relationship between \(x\) and \(y\) with an intercept of 2 and a slope of 3. The unobserved characteristic is distributed **standard normal} (\(\upsilon_i \sim \mathcal{N}(0, 1)\)). This means that the unobserved characteristic is drawn from a normal distribution with mean 0 and standard deviation of 1. We want to estimate the value of \(b\), which has a true value of \(3\).

# Create a simulated data set set.seed(123456789)
# use to get the exact same answer each time the code is run. 
N <- 100
# Set N to 100, to represent the number of observations.\
a <- 2 
b <- 3 
# model parameters of interest \# Note the use of \<- to mean "assign". 
x <- runif(N)
# create a vector where the observed characteristic, x,\
# is drawn from a uniform distribution. 
u <- rnorm(N)
# create a vector where the unobserved characteristic, \# u is drawn from a standard normal distribution. 
y <- a + b*x + u 
# create a vector y \#* allows a single number to be multiplied through \# the whole vector \# + allows a single number to be added to the whole vector \# or for two vectors of the same length to be added together.

A computer does not actually generate random numbers. It generates pseudo-random numbers. These numbers are derived from a distinct function. If you know the function and the current number, then you know exactly what the next number will be. The book takes advantage of this process by using the R function, set.seed() to generate exactly the same numbers every time.

plot(x,y)  # creates a simple plot
abline(a = 2,b = 3)  # adds a linear function to the plot.
# a - intercept, b - slope.

Figure 2 presents the true relationship between \(X\) and \(Y\) as the solid line sloping up. The observed relationship is represented as the circles randomly spread out over the box. The plot suggests that we can take averages in order to determine the true relationship between \(X\) and \(Y\).

Averaging to Estimate the Causal Effect}

Figure 2 suggests that we can use averaging to disentangle the effect of \(X\) on \(Y\) from the effect of \(U\) on \(Y\). Although the circles are spread out in a cloud, they follow a distinct upward slope. Moreover, the line that sits in the center of the cloud represents the true relationship. If we take values of \(X\) close to 0, then the average of \(Y\) will be near 2. If we look at values of \(X\) close to 1, the average of \(Y\) is near 5. That is, if we change the value of \(X\) by 1, the average value of \(Y\) increases by 3.

mean(y[x > 0.95]) - mean(y[x < 0.05])

[1] 3.32937

# mean takes an average \# the logical expression inside the square brackets # creates an index for the elements of y where the logical \# expression in x holds.

We can do this with the simulated data. In both cases, we are “close” but not actually equal to 0 or 1. The result is that we find a slope that is equal to 2.72, not 3.

We disentangle the effect of \(X\) and \(U\) by averaging out the unobserved characteristic. By taking the difference in the average of \(Y\) calculated at two different values of \(X\), we can determine how \(X\) affects the average value of \(Y\). In essence, this is what OLS does.

Assumptions of the OLS Model

The major assumptions of the OLS model are that unobserved characteristics enter independently and additively. The first assumption states that conditional on observed characteristics (the \(X\)’s), the unobserved characteristic (the \(U\)) has independent effects on the outcome of interest (\(Y\)). In Figure 1}, this assumption is represented by the fact there is no arrow from \(U\) to \(X\). In the econometrics literature, the assumption is given the awkward appellation, unconfoundedness (Imbens 2010).

To understand the independence assumption, it is helpful to go back to the original problem. We are interested in determining the economic value of attending college. Our estimated model of the effect of schooling on income allows the unobserved characteristics to determine income. Importantly, the model does not allow unobserved characteristics to affect both schooling and income. The model does not allow students from wealthy families to be more likely to go to college and get a good job due to their family background.

The second assumption states that unobserved characteristics enter the model additively.¹ The implication is that the effect of the policy cannot vary with unobserved characteristics. Attending college increases everyone’s income by the same amount. This assumption allows the effect of \(U\) and \(X\) to be disentangled using averaging. This assumption also allows the use of the quick and robust least squares algorithm.

The simulated data set satisfies these two assumptions. The unobserved characteristic, \(u\), is drawn independently of \(x\) and it affects \(y\) additively.

Matrix Algebra of the OLS Model

The chapter presents two algorithms for solving the model and revealing the parameter estimates; the algebraic algorithm and the least squares algorithm.² This section uses matrix algebra to derive and program up the OLS estimator in R.³

Standard Algebra of the OLS Model

Let’s simplify the problem. Consider (Equation 1) and let \(a = 2\), so we only need to determine the parameter \(b\).

\[ b = \frac{y_i - 2 - \upsilon_i}{x_i} \tag{2}\]

The parameter of interest (\(b\)) is a function of both observed terms (\(\{y_i,x_i\}\)) and the unobserved term (\(\upsilon_i\)).

Rearranging (Equation 2}) highlights two problems. The first problem is that the observed terms (\(\{y_i,x_i\}\)) are different for each person \(i\), but (Equation 2) states that \(b\) is exactly the same for each person. The second problem is that the unobserved term (\(\upsilon_i\)) is unobserved. We do not know what it is.

Luckily, we can “kill two birds with one stone.” As (Equation 2) must hold for each individual \(i\) in the data, we can determine \(b\) by averaging. Going back to the original equation ( (Equation 1)), we can take the average of both sides of the equation.

\[ \frac{1}{N}\sum_{i=1}^N y_i = \frac{1}{N}\sum_{i=1}^N (2 + b x_i + \upsilon_i) \tag{3}\]

We use the notation \(\sum_{i=1}^N\) to represent summing over \(N\) individuals in data. The Greek letter \(\sum\) is capital sigma.⁴

Summing through on the right-hand side, we have the slightly simplified equation below.

\[ \frac{1}{N}\sum_{i=1}^N y_i = 2 + b \frac{1}{N}\sum_{i=1}^N x_i + \frac{1}{N}\sum_{i=1}^N \upsilon_i\\ \\ or\\ \bar{y} = 2 + b \bar{x} + \bar{\upsilon} \tag{4}\]

where \(\bar{y}\) denotes the sample average.

Dividing by \(\bar{x}\) and rearranging to make the parameter of interest (\(b\)) as a function of the observed and unobserved terms:

\[ b = \frac{\bar{y} - 2 - \bar{\upsilon}}{\bar{x}} \tag{5}\]

Unfortunately, we cannot actually solve this equation and determine \(b\). The problem is that we still cannot observe the unobserved terms, the \(\upsilon_i\)’s.

However, we can estimate \(b\) using (Equation 5). The standard notation is \(\hat{b}\).

\[ \hat{b} = \frac{\bar{y} - 2}{\bar{x}} \tag{6}\]

The estimate of \(b\) is a function of things we observe in the data. The good news is that we can calculate \(\hat{b}\). The bad news is that we don’t know if it is equal to \(b\).

How close is our estimate to the true value of interest? How close is \(\hat{b}\) to \(b\)? If we have a lot of data and it is reasonable to assume the mean of the unobserved terms is zero, then our estimate will be very close to the true value. If we assume that \(E(u_i) = 0\) then by the Law of Large Numbers if \(N\) is large, \(\frac{1}{N}\sum_{i=1}^N u_i = 0\) and our estimate is equal to the true value (\(\hat{b} = b\)).⁵

Algebraic OLS Estimator in `R`

We can use the algebra presented above as pseudo-code for our first estimator in R.

b_hat <- (mean(y) - 2)/mean(x) 
b_hat

[1] 3.018012

Why does this method not give a value closer to the true value? The method gives an estimate of 3.46, but the true value is 3. One reason may be that the sample size is not very large. You can test this by running the simulation and increasing \(N\) to 1,000 or 10,000. Another reason may be that we have simplified things to force the intercept to be 2. Any variation due to the unobserved term will be taken up by the slope parameter giving an inaccurate or biased estimate.⁶

Using Matrices

In general, of course, we do not know \(a\) and so need to jointly solve for both \(a\) and \(b\). We use matrix algebra to solve this more complicated problem.

shows the equations representing the data. It represents a system of 100 linear equations.

\[ \left [\begin{array}{c} y_1\\ y_2\\ y_3\\ y_4\\ y_5\\ \vdots \end{array} \right] = \left [\begin{array}{c} 2 + 3x_1 + u_1\\ 2 + 3x_2 + u_2\\ 2 + 3x_3 + u_3\\ 2 + 3x_4 + u_4\\ 2 + 3x_5 + u_5\\ \vdots \end{array} \right] \tag{7}\]

Using matrix algebra, we can rewrite the system of equations.

\[ \left [\begin{array}{c} y_1\\ y_2\\ y_3\\ y_4\\ y_5\\ \vdots \end{array} \right] = \left [\begin{array}{cc} 1 & x_1\\ 1 & x_2\\ 1 & x_3\\ 1 & x_4\\ 1 & x_5\\ \vdots \end{array} \right] \left [\begin{array}{c} 2\\ 3 \end{array} \right] + \left [\begin{array}{c} u_1\\ u_2\\ u_3\\ u_4\\ u_5\\ \vdots \end{array} \right] \tag{8}\]

Notice how matrix multiplication is done. Standard matrix multiplication follows the rule below.

\[ \left[\begin{array}{cc} a & b\\ c & d \end{array}\right] \left[\begin{array}{cc} e & f\\ g & h \end{array}\right] = \left[\begin{array}{cc} ae + bg & af + bh\\ ce + dg & cf + dh \end{array}\right] \tag{9}\]

Check what happens if you rearrange the order of the matrices. Do you get the same answer?⁷

Multiplying Matrices in `R`

We now illustrate matrix multiplication with R.

x1 = x[1:5] 
# only include the first 5 elements 
X1 = cbind(1,x1) 
# create a matrix with a columns of 1s \# cbind means column bind - 
# it joins columns of the same length together. 
# It returns a "matrix-like" object. 
# Predict value of y using the model 
X1%*%c(2,3)

         [,1]
[1,] 3.299745
[2,] 4.007290
[3,] 3.270136
[4,] 2.743498
[5,] 2.101362

# See how we can add and multiply vectors and numbers in R. \# In R %*% represents standard matrix multiplication. \# Note that R automatically assumes c(2,3) is a column vector \# Compare to the true values 
y[1:5]

[1] 2.923367 4.478406 2.778367 4.852436 1.098138

In the simulated data we see the relationship between \(y\) and \(x\). Why aren’t the predicted values equal to the true values?⁸

Matrix Estimator of OLS

It is a lot more compact to represent the system in (1) with matrix notation.

\[ \mathbf{y} = \mathbf{X} \beta + \mathbf{u} \tag{10}\]

where \(\mathbf{y}\) is a \(100 \times 1\) column vector of the outcome of interest \(y_i\), \(\mathbf{X}\) is a \(100 \times 2\) rectangular matrix of the observed explanatory variables \(\{1, x_i\}\), \(\beta\) is a \(2 \times 1\) column vector of the model parameters \(\{a, b\}\) and \(\mathbf{u}\) is a \(100 \times 1\) column vector of the unobserved term \(u_i\).

To solve the system we can use the same “division” idea that we used for standard algebra. In matrix algebra, we do division by multiplying the inverse of the matrix (\(\mathbf{X}\)) by both sides of the equation. Actually, this is how we do division in normal algebra, it is just that we gloss over it and say we “moved” it from one side of the equation to the other.

The problem is that our matrix is not invertible. Only full-rank square matrices are invertible and our matrix is not even square.⁹ The solution is to create a generalized matrix inverse.

We can make our matrix (\(\mathbf{X}\)) square by pre-multiplying it by its transpose.¹⁰ The notation for this is \(\mathbf{X}'\). Also remember that in matrix multiplication order matters! Pre-multiplying means placing the transposed matrix to the left of the original matrix.

\[ \mathbf{X}' \mathbf{y} = \mathbf{X}' \mathbf{X} \beta + \mathbf{X}' \mathbf{u} \tag{11}\]

The matrix \(\mathbf{X}' \mathbf{X}\) is a \(2 \times 2\) matrix as it is a \(2 \times 100\) matrix multiplied by a \(100 \times 2\) matrix.¹¹

To solve for the parameters of the model (\(\beta\)), we pre-multiply both side of (Equation 10) by the matrix we created. Thus we have the following equation.

\[ (\mathbf{X}'\mathbf{X})^{-1} \mathbf{X}' \mathbf{y} = (\mathbf{X}'\mathbf{X})^{-1} \mathbf{X}' \mathbf{X} \beta + (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}' \mathbf{u} \tag{12}\]

Simplifying and rearranging we have the following linear algebra derivation of the model.

\[ \beta = (\mathbf{X}'\mathbf{X})^{-1} \mathbf{X}' \mathbf{y} - (\mathbf{X}'\mathbf{X})^{-1} \mathbf{X}' \mathbf{u} \tag{13}\]

From this we have the matrix algebra based estimate of our model.

\[ \hat{\beta} = (\mathbf{X}'\mathbf{X})^{-1} \mathbf{X}' \mathbf{y} \tag{14}\]

Remember that we never observe \(u\) and so we do not know the second part of (Equation 13).¹²

We didn’t just drop the unobserved term, we averaged over it. If the assumptions of OLS hold, then summation \((\mathbf{X}' \mathbf{X})^{-1} \mathbf{X}' \mathbf{u}\) will generally be close to zero. This issue is discussed more in Chapter 3 and Appendix A.

Matrix Estimator of OLS in `R`

We can follow a similar procedure for inverting a matrix using R. The matrix of explanatory variables includes a first column of 1’s which accounts for the intercept term.

X <- cbind(1,x) 
# remember the column of 1's

Next we need to transpose the matrix. To illustrate a matrix transpose consider a matrix \(\mathbf{A}\) which is \(3 \times 2\) (3 rows and 2 columns) and has values 1 to 6.

A <- matrix(c(1:6),nrow=3) 
# creates a 3 x 2 matrix. 
A

     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6

# See how R numbers elements of the matrix. 
t(A)

     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6

# transpose of matrix A 
t(A)%*%A

     [,1] [,2]
[1,]   14   32
[2,]   32   77

# matrix multiplication of the transpose by itself

In our problem the transpose multiplied by itself gives the following \(2 \times 2\) matrix.

t(X)%*%X

                   x
  100.00000 51.25536
x  51.25536 34.87503

In R, the matrix inverse can be found using the solve() function.

solve(t(X)%*%X)

                        x
   0.04053393 -0.05957218
x -0.05957218  0.11622625

# solves for the matrix inverse.

The matrix algebra presented (Equation 14}) is pseudo-code for the operation in R.

beta_hat <- solve(t(X)%*%X)%*%t(X)%*%y 
beta_hat

      [,1]
  1.803391
x 3.401600

Our estimates are \(\hat{a} = 2.18\) and \(\hat{b} = 3.09\). These values are not equal to the true values of \(a = 2\) and \(b = 3\), but they are fairly close. Try running the simulation again, but changing \(N\) to 1,000. Are the new estimates closer to the their true values? Why?¹³

We can check that we averaged over the unobserved term to get something close to 0.

solve(t(X)%*%X)%*%t(X)%*%u

        [,1]
  -0.1966095
x  0.4016003

Least Squares Method for OLS

An alternative algorithm is least squares. This section presents the algebra of least squares and compares the linear model (lm()) estimator to the matrix algebra version derived above.

Moment Estimation

Least squares requires that we assume the unobserved characteristic has a mean of 0. We say that the first moment of the unobserved characteristic is 0. A moment refers to the expectation of a random variable taken to some power. The first moment is the expectation of the variable taken to the power of 1. The second moment is the expectation of the variable taken to the power of 2, etc.

\[ E(u_i) = 0 \tag{15}\]

From (Equation 1) we can rearrange and substitute into (Equation 15).

\[ E(y_i - a - b x_i) = 0 \tag{16}\]

states that the expected difference, or mean difference, between \(y_i\) and the predicted value of \(y_i\) (\(a + bx_i\)) is 0.

The sample analog of the left-hand side of (2}) is the following average.

\[ \frac{1}{N} \sum_{i=1}^{N} (y_i - a - b x_i) \tag{17}\]

The sample equivalent of the mean is the average. This approach to estimation is called analog estimation (Manski 1988). We can make this number as close to zero as possible by minimizing the sum of squares, that is, finding the “least squares.”

Algebra of Least Squares

Again it helps to illustrate the algorithm by solving the simpler problem. Consider a version of the problem in which we are just trying to estimate the \(b\) parameter.

We want to find the \(b\) such that (2) holds. Our sample analog is to find the \(\hat{b}\) that makes (Equation 17) as close to zero as possible.

\[ \min_{\hat{b}} \frac{1}{N}\sum_{i=1}^N (y_i - 2 - \hat{b} x_i)^2 \tag{18}\]

We can solve for \(\hat{b}\) by finding the first order condition of the problem.¹⁴

\[ \frac{1}{N} \sum_{i=1}^N -2x_i(y_i - 2 - \hat{b}x_i) = 0 \tag{19}\]

Note that we can divide both sides by -2, giving the following rearrangement.

\[ \hat{b} = \frac{\frac{1}{N} \sum_{i=1}^N x_i y_i - 2 \frac{1}{N} \sum_{i=1}^N x_i}{\frac{1}{N} \sum_{i=1}^N x_i x_i} \tag{20}\]

Our estimate of the relationship \(b\) is equal to a measure of the covariance between \(X\) and \(Y\) divided by the variance of \(X\). This is the sense that we can think of our OLS estimate as a measure of the correlation between the independent variable (the \(X\)’s) and the outcome variable (the \(Y\)).

Estimating Least Squares in `R`

We can program the least squares estimator in two ways. First, we can solve the problem presented in (Equation 18). Second, we can use the solution to the first order condition presented in (Equation 20).

optimize(function(b) sum((y - 2 - b*x)^2), c(-10,10))$minimum

[1] 3.112646

# optimize() is used when there is one variable. 
# note that the function can be defined on the fly 
# $minimum presents one of the outcomes from optimize()

We can use optimize() to find a single optimal value of a function. We can use function() to create the sum of squared difference function, which is the first argument of optimize(). The procedure searches over the interval which is the second argument. Here it searches over the real line from -10 to 10 (c(-10,10)).¹⁵ It looks for the value that minimizes the sum of the squared differences. The result is 3.37. Why do you think this is so far from the true value of 3?¹⁶

Alternatively, we can use the first order condition.

(mean(x*y) - 2*mean(x))/mean(x*x)

[1] 3.112646

Solving out (Equation 20) in R gives an estimate of 3.37. Why do these two approaches give identical answers?

The `lm()` Function

The standard method for estimating OLS in R is to use the lm() function. This function creates an object in R. This object keeps track of various useful things such as the vector of parameter estimates.

data1 <- as.data.frame(cbind(y,x)) 
# creates a data.frame() object which will be used in the 
# next section. 
lm1 <- lm(y ~ x) 
# lm creates a linear model object 
length(lm1) # reports the number of elements of the list object

[1] 12

names(lm1) # reports the names of the elements

 [1] "coefficients"  "residuals"     "effects"       "rank"         
 [5] "fitted.values" "assign"        "qr"            "df.residual"  
 [9] "xlevels"       "call"          "terms"         "model"

We can compare the answers from the two algorithms, the lm() procedure and the matrix algebra approach.

lm1$coefficients

(Intercept)           x 
   1.803391    3.401600

# reports the coefficient estimates 
# $ can be used to call a particular element from the list. 
# lm1[1] reports the same thing. 
t(beta_hat) # results from the matrix algebra.

                   x
[1,] 1.803391 3.4016

Comparing the two different approaches in R, we can see they give identical estimates. This is not coincidental. The solve() function is based on a least squares procedure. It turns out there is no real computational difference between an “algebra” approach and a “least squares” approach in R.

Measuring Uncertainty

How do we give readers a sense of how accurate our estimate is? The previous section points out that the estimated parameter is not equal to the true value and may vary quite a bit from the true value. Of course, in the simulations we know exactly what the true value is. In real world econometric problems, we do not.

This section provides a brief introduction to some of the issues that arise when thinking about reporting the uncertainty around our estimate. Appendix A discusses these issues in more detail.

Data Simulations

Standard statistical theory assumes that the data we observe is one of many possible samples. Luckily, with a computer we can actually see what happens if we observe many possible samples. The simulation is run 1,000 times. In each case a sample of 100 is drawn using the same parameters as above. In each case, OLS is used to estimate the two parameters \(\hat{a}\) and \(\hat{b}\).

set.seed(123456789) 
K <- 1000 
a <- 2 
b <- 3 
sim_res <- matrix(NA,K,2) 
# creates a 1000 x 2 matrix filled with NAs. 
# NA denotes a missing number in R. 
# I prefer using NA rather than an actual number like 0. 
# It is better to create the object to be filled 
# prior to running the loop.
# This means that R has the object in memory. 
# This simple step makes loops in R run a lot faster. 
for (k in 1:K) { # the "for loop" starts at 1 and moves to K
  x <- runif(N) 
  u <- rnorm(N, mean = 0, sd = 2)
  y <- a + b*x + u 
  sim_res[k,] <- lm(y ~ x)$coefficients 
  # inputs the coefficients from simulated data into the 
  # kth row of the matrix. 
  # print(k) 
  # remove the hash to keep track of the loop 
  } 
colnames(sim_res) <- c("Est of a", "Est of b") 
# labels the columns of the matrix.

require(knitr)

Loading required package: knitr

# summary produces a standard summary of the matrix.
sum_tab <- summary(sim_res)
rownames(sum_tab) <- NULL  # no row names.
# NULL creates an empty object in R.
kable(sum_tab)

Table 1: Summary of 1,000 simulations of OLS estimates of samples with 100 observations. The true values are \(a = 2\) and \(b = 3\).

Est of a	Est of b
Min. :0.7771	Min. :0.5079
1st Qu.:1.7496	1st Qu.:2.5155
Median :2.0295	Median :2.9584
Mean :2.0260	Mean :2.9666
3rd Qu.:2.3113	3rd Qu.:3.4122
Max. :3.5442	Max. :5.9322

Table 1 summarizes the results of the simulation. On average the estimated parameters are quite close to the true values, differing by 0.026 and 0.033 respectively. These results suggest the estimator is unbiased. An estimator is unbiased if it is equal to the true value on average. The OLS estimator is unbiased in this simulation because the unobserved term is drawn from a distribution with a mean of 0.

The table also shows that our estimates could differ substantially from the true value. At worst, the estimated value of the \(b\) parameter could be almost twice its true value. As an exercise, what happens when the sample size is decreased or increased? Try \(N = 10\) or \(N = 5,000\).

Introduction to the Bootstrap

It would be great to present the analysis above for any estimate. Of course we cannot do that because we do not know the true values of the parameters. The Stanford statistician, Brad Efron, argues we can use the analogy principle. We don’t know the true distribution but we do know its sample analog, which is the sample itself.

Efron’s idea is called bootstrapping. And no, I have no idea what a bootstrap is. It refers to the English saying, “to pull one’s self up by his bootstraps.” In the context of statistics, it means to use the statistical sample itself to create an estimate of how accurate our estimate is. Efron’s idea is to repeatedly draw pseudo-samples from the actual sample, randomly and with replacement, and then for each pseudo-sample re-estimate the model.

If our sample is pretty large then it is a pretty good representation of the true distribution. We can create a sample analog version of the thought exercise above. We can recreate an imaginary sample and re-estimate the model on that imaginary sample. If we do this a lot of times we get a distribution of pseudo-estimates. This distribution of pseudo-estimates provides us with information on how uncertain our original estimate is.

Bootstrap in `R`

We can create a bootstrap estimator in R. First, we create a simulated sample data set. The bootstrap code draws repeatedly from the simulated sample to create 1000 pseudo-samples. The code then creates summary statistics of the estimated results from each of the pseudo-samples.

set.seed(123456789) 
K <- 1000 
bs_mat <- matrix(NA,K,2) 
for (k in 1:K) { 
  index_k <- round(runif(N,min=1,max=N)) 
  # creates a pseudo-random sample. 
  # draws N elements uniformly between 1 and N. 
  # rounds all the elements to the nearest integer. 
  data_k <- data1[index_k,] 
  bs_mat[k,] <- lm(y ~ x, data=data_k)$coefficients 
  # print(k) 
  } 
tab_res <- matrix(NA,2,4) 
tab_res[,1] <- colMeans(bs_mat)
# calculates the mean for each column of the matrix. 
# inputs into the first column of the results matrix.
tab_res[,2] <- apply(bs_mat, 2, sd) 
# a method for having the function sd to act on each 
# column of the matrix. Dimension 2 is the columns. 
# sd calculates the standard deviation. 
tab_res[,3] <- quantile(bs_mat[,1],c(0.025,0.975)) 
# calculates quantiles of the column at 2.5% and 97.5%.
tab_res[,4] <- quantile(bs_mat[,2],c(0.025,0.975)) 
colnames(tab_res) <- c("Mean", "SD", "2.5%", "97.5%") 
rownames(tab_res) <- c("Est of a","Est of b") 
# labels the rows of the matrix.

kable(tab_res)

Table 2: Bootstrapped estimates from the simulation.

	Mean	SD	2.5%	97.5%
Est of a	1.806365	0.1992615	1.409638	2.775915
Est of b	3.384804	0.3095560	2.183371	4.018614

Table 2 presents the bootstrapped estimates of the OLS model on the simulated data. The table presents the mean of the estimates from the pseudo-samples, the standard deviation of the estimates and the range which includes 95% of the cases. It is good to see that the true values do lie in the 95% range.

Standard Errors

kable(summary(lm1)[4])

Table 3: OLS results from the simulation.

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	1.80339	0.1974404	9.133847	0
x	3.40160	0.3343327	10.174299	0

Table 3 presents the standard results that come out of the lm() procedure in R. Note that these are not the same as the bootstrap estimates.¹⁷ The second column provides information on how uncertain the estimates are. It gives the standard deviation of the imaginary estimates assuming that the imaginary estimates are normally distributed. The last two columns present information which may be useful for hypothesis testing. This is discussed in detail in Appendix A.

The lm() procedure assumes the difference between the true value and the estimated value is distributed normally. That turns out to be the case in this simulation. In the real world there may be cases where it may be less reasonable to make this assumption.

Returns to Schooling

Now we have the basics of OLS, we can start on a real problem. One of the most studied areas of microeconometrics is returns to schooling (Card 2001). Do policies that encourage people to get more education, improve the economic outcomes? One way to answer this question is to use survey data to determine how much education a person received and how much income they earned.

Berkeley labor economist David Card, analyzed this question using survey data on men aged between 14 and 24 in 1966 (Card 1995). The data set provides information on each man’s years of schooling and income. These are measured ten years later, in 1976.

A Linear Model of Returns to Schooling

Card posits that income in 1976 is determined by the individual’s years of schooling.

\[ \mathrm{Income}_i = \alpha + \beta \mathrm{Education}_i + \mathrm{Unobserved}_i \tag{21}\]

In (Equation 21), income in 1976 for individual \(i\) is determined by their education level and other unobserved characteristics such as the unemployment rate in the place they live. We want to estimate \(\beta\).

NLSM Data

The National Longitudinal Survey of Older and Younger Men (NLSM) is a survey data set used by David Card. Card uses the young cohort who were in the late teens and early twenties in 1966. Card’s version can be downloaded from his website, http://davidcard.berkeley.edu/data_sets.html.¹⁸ The original data is a .dat file. The easiest way to wrangle the data is to import it into excel as a “fixed width” file and then copy the variable names from the codebook file. You can then add the variable names to the first line of the excel file and save it as a .csv file. In the code below I named the file nls.csv.¹⁹

x <- read.csv("nls.csv", as.is=TRUE) 
# I follow the convention of defining any data set as x 
# (or possibly y or z). 
# I set the working directory to the one where my main 
# script is saved, which is the same place as the data.
# Sessions -> Set Working Directory -> Source File Location 
# It is important to add "as.is = TRUE", 
# otherwise R may change your variables into "factors" 
# which is confusing and leads to errors. 
# factors can be useful when you want to instantly create 
# a large number of dummy variables from a single variable.
# However, it can make things confusing if you don't realize 
# what R is doing. 
x$wage76 <- as.numeric(x$wage76)

Warning: NAs introduced by coercion

x$lwage76 <- as.numeric(x$lwage76)

Warning: NAs introduced by coercion

# changes format from string to number. 
# "el wage 76" where "el" is for "log" 
# Logging helps make OLS work better. This is because wages 
# have a skewed distribution, and log of wages do not. 
x1 <- x[is.na(x$lwage76)==0,] 
# creates a new data set \# with missing values removed 
# is.na() determines missing value elements ("NA"s)

After reading in the data, the code changes the format of the data. Because some observations have missing values, the variables will import as strings. Changing the string to a number creates a missing value code NA. The code then drops the missing variables.

Plotting Returns to Schooling

lm1 <- lm(lwage76 ~ ed76, data=x1)
plot(x1$ed76,x1$lwage76, xlab="Years of Education", 
     ylab="Log Wages (1976)")
# plot allows us to label the charts
abline(a=lm1$coefficients[1],b=lm1$coefficients[2],lwd=3)

Figure 3 presents a simple plot of the relationship between log wages in 1976 and the years of schooling. Again we see a distinct positive relationship even though the observations are spread out in a cloud. We can see that people who do not graduate from high school (finish with less than 12 years of education) earn less on average than those who attend college (have more than 12 years of education). There is a lot of overlap between the distributions.

Estimating Returns to Schooling

kable(summary(lm1)[4])

Table 4: OLS regression of log wages from 1976 on the number of years of schooling up to 1976.

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	5.5708823	0.0388295	143.47049	0
ed76	0.0520942	0.0028697	18.15315	0

We can use OLS to estimate the average effect of schooling on income. Table 4 gives the OLS estimate. The coefficient estimate of the relationship between years of schooling and log wages is 0.052. The coefficient is traditionally interpreted as the percentage increase in wages associated with a 1 year increase in schooling (Card 1995). This isn’t quite right, but it is pretty close. Below shows the predicted percentage change in wages, measured at the mean of wages, being 5.4%. Compare this to reading the coefficient as 5.2%.

exp(log(mean(x1$wage76)) + lm1$coefficients[2])/mean(x1$wage76)

    ed76 
1.053475

This estimate suggests high returns to schooling. However, the model makes a number of important assumptions about how the data is generated. The following chapters discuss the implications of those assumptions in detail.

Discussion and Further Reading

OLS is the “go to” method for estimation in microeconometrics. The method makes two strong assumptions. First, the unobserved characteristics must be independent of the policy variable. Second, the unobserved characteristics must affect the outcome variable additively. These two assumptions allow averaging to disentangle the causal effect from the effect of unobserved characteristics. We can implement the averaging using one of two algorithms, a matrix algebra-based algorithm or the least squares algorithm. In R, the two algorithms are computationally equivalent. The next two chapters consider weakening the assumptions. Chapter 2 takes the first step by allowing the OLS model to include more variables. Chapter 3 considers cases where neither assumption holds.

Measuring and reporting uncertainty has been the traditional focus of statistics and econometrics. This book takes a more “modern” approach to focus attention on issues around measuring and determining causal effects. This chapter takes a detour into the traditional areas. It introduces the bootstrap method and discusses traditional standard errors. These topics are discussed in more detail in the Appendix.

The chapter introduces causal graphs. Pearl and Mackenzie (2018) provide an excellent introduction to the power of this modeling approach.

To understand more about the returns to schooling literature, I recommend Card (2001). Chapters 2 and 3 replicate much of the analysis presented in Card (1995)}. The book returns to the question of measuring returns to schooling in Chapters 2, 3, 6, 8 and 11.

References

Card, David. 1995. “Aspects of Labour Market Behavior: Essays in Honour of John Vanderkamp.” In, edited by Louis N. Christofides, E. Kenneth Grant, and Robert Swidinsky, 201–22. University of Toronto Press.

———. 2001. “Estimating the Returns to Schooling: Progress on Some Persistent Econometric Problems.” Econometrica 69 (5): 1127–60.

Imbens, Guido W. 2010. “Better LATE than nothing: some comments on Deaton (2009) and Heckman and Urzua (2009).” Journal of Economic Literature 48 (2): 399–423.

Manski, Charlies F. 1988. Analog Estimation Methods in Econometrics. Chapman; Hall.

Pearl, Judea, and Dana Mackenzie. 2018. The Book of Why: The New Science of Cause and Effect. Basic Books.

Footnotes

OLS can estimate models where the unobserved characteristic enters multiplicatively as this model is additive in logs.↩︎
Under additional assumptions we can use other algorithms such as maximum likelihood or the generalized method of moments.↩︎
For a matrix algebra refresher check out Khan Academy and Sol’s excellent presentation of the basic concepts. Appendix B presents additional discussion of the use of matrices in R.↩︎
This is useful to remember if you are ever hungry in Athens because it is the first letter of the famous Greek sandwich, the souvlaki (\(\Sigma o u \beta \lambda \alpha \kappa i\)).↩︎
See discussion in Appendix A.↩︎
An estimate is said to be biased if we take a large number of imaginary samples and the average imaginary estimate differs from the true value. This issue is discussed in detail in Appendix A.↩︎
No. In matrix algebra, order matters!↩︎
They are not generally equal because we didn’t include the unobserved term (\(u\)).↩︎
The rank of the matrix refers to the number of linearly independent columns (or rows). A matrix is full-rank if all of its columns (rows) are linearly independent of each other.↩︎
A transpose of a matrix is the same matrix with the columns and rows swapped.↩︎
Remember that for matrix multiplication the “inside” numbers need to match.↩︎
We traditionally use the ``hat’’ notation to represent that \(\hat{\beta}\) is the estimate of \(\beta\).↩︎
In the simulation the unobserved term has a mean of 0. As mentioned above, the average of the unobserved term gets closer to zero when \(N\) is larger.↩︎
Under certain conditions the optimal value, minimum or maximum, can be determined by the first order condition. The first order condition can be found by taking the derivative and setting the equation to zero.↩︎
This is arbitrary, but I know the true values lie between these two values.↩︎
See the discussion above for the previous estimators.↩︎
The derivation of these numbers is discussed in Appendix A.↩︎
This and other similar surveys can also be found at the US Bureau of Labor Statistics (https://www.bls.gov/nls/home.htm).↩︎
This data is found here: https://sites.google.com/view/microeconometricswithr/table-of-contents Note before running this code you need to create a script, say Card.R and save that file with your data. You can then go to the RStudio menu and set the working directory to the location of your script (Session -> Set Working Directory -> To Source File Location).↩︎