7 Regression with multiple regressors

Abstract

This chapter covers multiple regression analysis, which is a method for estimating the parameters that describe a linear relationship between several regressor variables and an outcome.

Keywords: Hypothesis test, confidence interval, bias, least-squares

7.1 Multiple regression analysis

Previous chapters have examined the single variable regression model. A variable \(X\) is hypothesized to affect an outcome \(Y\) . The relationship is assumed to be linear. The explanatory variable \(X\) is assumed to be independent of, or uncorrelated with, any other factors influencing the outcome, or systematic measurement error of the outcome. This assumption is expressed mathematically as \(E(ϵ_i|X) = 0\); the expected error term is assumed to be equal to zero, conditional on the level of the explanatory variable. The assumption would be violated if other factors influencing the outcome were correlated with the level of the explanatory variable. In the earnings and education example, with earnings as the outcome variable, the assumption would be violated if individuals who had many years of work experience were also likely to be people with lower education. That is, individuals who started working after completing primary school would have perhaps six more years of work experience than people who had completed high school. Earnings, one might imagine, are influenced by both work experience and education.

If earnings are due to two factors, then implicitly we have a new, different model. If the relationship is linear, we might write: \[ earnings_usd = β_0 + β_1educ\_yrs + β_2exper + ϵ\]

where \(exper\) is a variable measuring the years of work experience of an individual, and \(educ\_yrs\) is, as in previous chapters, the years of education attainment of an individual. We could also have the education variable be the log of education, and interpret the estimated coefficient accordingly.

Expressed more generally, the multiple regression model is: \[Y = β_0 + β_1 X1 + β_2X_2 + · · · + β_k X_k + ϵ\] where the \(β_j\) are the parameters describing how a change in the associated explanatory variable \(X_j\) affects the outcome. This model is sometimes called the multiple regression model, or the multivariate regression model, or the multivariable regression model. The model does not have to be limited to two or three explanatory variables, and could in principle have tens of thousands (or just \(k\), generally) of explanatory variables.

We also can write the model by indexing for each entity in the population, noting that the model with the error terms “holds” for each entity (for that reason this is sometimes called the population regression model), \[Y_i = β_0 + β_1X_{1i} + β_2X_{2i} + β_3X_{3i} + · · · + β_kX_{ki} + ϵ_i, i = 1, . . . ,N. \] As before, a key assumption for the validity or credibility of the interpretation of the coefficients estimated using a sample of data from the relevant population is that the other factors influencing the outcome (omitted from the \(X\)’s in the equation and thus represented by the \(ϵ\) are uncorrelated with the explanatory variable. Mathematically,

\[E(ϵ_i|X_1,X_2, . . . ,X_k) = 0 \] The expected value of the error term, conditional on the levels of the explanatory variables, is equal to zero. These other variables that influence the outcome are sometimes called “confounders” if they are correlated with the included \(X\). So the assumption is saying, colloquially, that there are no confounders. Some social scientists prefer the term “omitted variables” to indicate that their absence means there is no “omitted variable bias.” In this formulation, all of the confounders are included in the \(X\) so there are no omitted confounders. Usage varies and evolves!

We estimate \(β_0, β_1, . . . , β_k\) using the same method of ordinary least squares, choosing values \(b_0, b_1, b_2, . . . , b_k\) for the coefficients that minimize the sum of squared residuals (SSR):

\[\sum_{i=1}^n(Y_i- b_0 − b_1X_{1i} − b_2X_{2i} − · · · − b_kX_{ki})^2\] As in the single variable case, we can use calculus to solve the problem, and obtain formulas for the estimated coefficients \(\hat{\beta_0}, \hat{\beta_1}, . . . , \hat{\beta_k}\). These formulas are too complex to represent unless we use a compact notation system called matrix algebra and so that is left for the reader to explore in the future. Suffice it to say that the basic intuition is straightforward: the sign and magnitude of the estimated coefficient for any particular explanatory variable will depend on the correlation between that variable and the outcome and also the correlation between the explanatory variable and the other explanatory variables. Imagine, for example, that a variable has a true sizable positive effect on some outcome, but that the variable is also strongly correlated with another variable that negatively affects the outcome. The ordinary least squares method of estimating the coefficient will take into account the cross-correlations of the variable with the other explanatory variables.

7.2 Sampling distribution

The estimated coefficients can be thought of as random variables, and will thus have a (joint) sampling distribution. This sampling distribution is somewhat complex, and requires matrix algebra to represent. If the usual assumptions in the multiple regression model hold (the same assumptions as in the single-variable case, with one extra assumption of no perfect multicollinearity, discussed below), then, in large samples, the OLS estimators \(\hat{\beta_0}, \hat{\beta_1}, . . . , \hat{\beta_k}\) are jointly normally distributed. We also say that their joint distribution is multivariable normal. Each \(\hat{\beta}_j\) follows the normal distribution, \(N(β_j , σ^2_{β_j} )\). That is, we can show, using basic matrix algebra, that the expected value of each coefficient estimate is equal to the respective true value or population parameter. (It is a matter of philosophical discussion in statistics what one means, precisely, by “true values” but that discussion can be left for another day.)

The variance of the estimated coefficients (or the square root of the variance, called the standard error) can also be determined using matrix algebra. These standard errors are computed from the sample data under the assumption of homoskedasticity or alternatively under the assumption of heteroskedasticity. In the multiple regression case, homoskedastic errors means that,

\[var(ϵ|X_1,X_2, . . . ,X_k) = \text{constant, for } i=1, . . . , n \](6) As in the single variable regression case, the standard errors can be used to calculate t-statistics for each coefficient. These t-stats follow the standard normal distribution.

\[ t^{act} = \frac{\hat{\beta}_j - \beta_{j, 0}}{SE(\hat{\beta}_j)}\] One may then calculate p-values for testing the usual null hypothesis that the coefficient is equal to zero (no effect of the explanatory variable on the outcome) or other hypotheses.

7.3 Interpreting the coefficients

The regression coefficients for most variables can be interpreted in English as follows: Controlling for the other included variables, and assuming that other factors are not correlated with the error term, the coefficient indicates how much a one unit change in the explanatory variable affects the outcome. The intercept coefficient, \(β_0\) (also sometimes called the constant), is rarely of interest and usually has little economics interpretation.

In the case of a coefficient on a dummy variable, the interpretation is: Controlling for the other included variables, and assuming that other factors are not correlated with the error term, the coefficient indicates the expected difference in the outcome between the category 1 and the category 0.

In the case of coefficients estimated for logged variables, or if the outcome is a logged variable, or if the outcome is a dummy variable, then the interpretation is straightforwardly similar to the interpretation from the relevant single variable regression case. With the addition of the “controlling for” language.

As in the single variable regression, another way to think about interpreting the estimated regression coefficients in the multiple regression case is by using them to “predict” values of \(Y\) that one might expect, based on the regression results, for given values of the \(X_k\). Suppose we estimated a regression line that was \(Y = β_0 + β_1X = 20 + 2X_1 + 7X_2\). If \(X_1\) were equal to 20 and \(X_2\) were equal to 10, we would predict that \(Y\) would be equal to 130. If \(X_1\) were equal to 40 and X2 were equal to 0, we would predict that \(Y\) would be equal to 100. We can use the regression estimates to predict likely or expected values of \(Y\) for different levels of \(X_k\).

To recap, when we are interpreting the estimated regression coefficients \(\hat{\beta_1}\) and \(\hat{\beta_2}\) in a regression \(Y = 20.7 − 3.2X_1 + 7.6X_2\) we would say, or write:

\(\text{The regression results suggest that for a one unit change in }X_1 \text{ the outcome will decrease by } 3.2, \text{controlling for the level of } X_2, \text{while a one unit change in} X_2 \text{is associated with an increase in the outcome of 7.6 units, controlling for the level of } X_1}\). Please again note the use of the word “suggest” here, rather than more declarative words such as “show,” “demonstrate,” or “prove.” The stock phrase “controlling for other factors” usually just has to be written once, in a discussion of regression results, and is implied for interpretation of other coefficients. Sometimes the phrases “partial effect” or “holding constant the other variables” are used in this context.

7.4 Standardized coefficients

Very often one wants to compare the coefficients estimated for different variables in the multiple regression framework. If the outcome is earnings, and education and experience are the two explanatory variables, it is natural to ask ”Which variable matters more?” The two variables are measured in different units, however, and the coefficients mechanically reflect the units (as we saw in the single variable regression example). Divide the \(X\) variable by 100 and the estimated coefficient is also divided by 100. Multiple the \(X\) variable by 100 and the estimated coefficient is also multiplied by 100. In order to compare coefficients, social scientists very often compute “standardized coefficients.” We rescale the coefficient by measuring both variables in their sample standard deviations rather than their units. Before running the regression, divide each \(X\) variable by its standard deviation and divide the outcome variable by its standard deviation. The estimated coefficients then represent the effect of increasing the \(X\) variable by one standard deviation, in terms of standard deviation changes in \(Y\) . That is, the standardized slope coefficient tells us the expected marginal effect of a one-standard-deviation change in \(X\) on \(Y\), measured in standard deviations of \(Y\). So one might say, a 1 standard deviation change in \(X\) is associated with a .25 standard deviation change in \(Y\) .

Confusingly, standardized coefficients are sometimes called “beta coefficients” – not to be confused with the coefficient symbol \(β\), which we call beta.

Note also that standardized coefficients do not make much sense in the case of log specifications or for dummy variables.

In more everyday language, standardized coefficients tell us how much a “typical” change in the \(X\) variable will move \(Y\), other things equal. We allow the data to tell us what is a typical amount of variation. The standardized coefficient is just one way of judging magnitude of effect, and is more appropriate for continuous variables than for binary variables.

Note that we can also calculate standardized coefficients simply by taking each coefficient and multiplying it by \(sd(X_k)\) and dividing by \(sd(Y )\), where \(sd(X_k)\) is the standard deviation for the \(X_k\) variable. So if a coefficient of \(X_k\) were 5.2, and the standard deviation of the \(X_k\) variable were 3.0, while the standard deviation of the \(Y\) variable were 2.0, then the standardized coefficient would be equal to \(5.2 ∗ 3.0/2.0 = 7.8\). So a one standard deviation change in \(X_k\) is associated with a 7.8 standard deviation change in \(Y\) . This is a big effect, one might think.

Of course, interpreting and comparing the magnitudes of coefficients on different variables should always depend on a contextual understanding of the regression. The effect of a change in a tax rate on consumption of a good may be small, but if the good is a “noxious” good that leads to significant health problems, even a small effect might be thought of as supporting evidence for policy reform. If, on the other hand, someone estimated that a large tax rate increase only led to a small change in government revenue, and raising revenue was the goal, then the small magnitude would seem to suggest that policy debates would be better focused on other sources of revenue. Thinking of the economic or social importance of changes in \(X\), and the economic or social importance of changes in \(Y\) , is the right way to interpret coefficients.

7.5 Goodness of fit

Overall goodness of fit measures can be calculated from the multiple regression results. The formula for \(R^2\) is again the explained sum of squares divided by the total sum of squares. This statistic, as an indicator of goodness of fit, has a problem. The problem is that if one adds variables to the regression, the \(R^2\) automatically rises, because a new variable either contributes no value or some value in explaining the outcome. The OLS method chooses estimates to minimize the sum of squared residuals: adding a new explanatory variable can never increase the sum of squared residuals because the algorithm always has the “choice” of giving the new variable a zero coefficient. Consider a thought experiment: the \(R^2\) without the additional explanatory variable is .37. The estimated OLS regression with the additional variable included yields an \(R^2\) of .33, the \(R^2\) goes down. But if the algorithm had chosen to have the coefficient of the new variable to be equal to exactly zero, the \(R^2\) would have been .37, same as the first regression. In other words, the estimated coefficient that generated the \(R^2\) of .33 could not have been the one that minimized the sum of squared residuals. Thus, the contradiction illuminates why \(R^2\) always goes up when new variables are added. Statisticians thus have developed a modified measure of goodness of fit that penalizes the researcher for adding new variables to the model. The adjusted \(R^2\) formula, sometimes called \(\overline{R^2}\), is, \[ \overline{R^2} = 1 - \frac{n-1}{n-k-1} \frac{SSR}{TSS}\] where \(n\) is the sample size and \(k\) is the number of regressors (the number of explanatory variables).

It bears repeating that the goal of regression analysis is not to “have a high \(\overline{R^2}\).” First of all, many outcomes in the social sciences by their nature are highly idiosyncratic. Human behavior has a lot of randomness because it is intentional rather than mechanical. Second, different combinations of variables will yield different levels of \(\overline{R^2}\). these different combinations are often called specifications. In choosing which is the “best” specification, social scientists are guided by theory implicit or explicit theory of which variables matter in explaining the outcome. A specification that includes a variable that has nothing to do with the theory of the relationship may not be preferable, despite having a higher \(\overline{R^2}\).

As before, the standard error of the regression is given by, \[SER= \sqrt{\frac{1}{n-k-1} \hat{\epsilon^2_i}}\] where again \(k\) is the number of regressors (the number of explanatory variables).

7.6 OLS assumptions

We noted above the need for one more assumption in the multiple regression model. The assumptions from the single variable case are generalized to the multiple regression case as: • The \(X\) variable of interest causes the \(Y\) variable that is the outcome • The relationship is a linear relationship (with variables suitably transformed) • \(E(ϵ_i|X_1,X_2, . . . ,X_k) = 0\) • \((Y,X_1,X_2, . . . ,X_k)\) are independent and identically distributed, i.i.d. • The probability of large outliers is low The new assumption is that there is no perfect multicollinearity among the \(X\) variables. That is, none of the \(X\) variables can be a linear combination of the other \(X\) variables. Consider a simple absurd example. Suppose an econometrician estimated the relationship between productivity at construction work and outside air temperature. The hypothesis is that on hotter days construction workers are less productive. The single variable regression finds exactly that, estimating a negative coefficient. Now suppose the researcher decides to include, in addition to the temperature measured in Fahrenheit, the temperature measured in centigrade. This second variable is just a linear transformation of the first variable.

We make two observations. First, if we knew matrix algebra we could easily show that including the two variables would mean that in our calculations to minimize the sum of squared residuals at a certain point we would be dividing by zero. Since this cannot be done in mathematics, our OLS estimator will not be defined. Statistical computing software will usually “drop” one of the variables if it finds there is perfect multicollinearity. Second, consider our interpretation of the regression coefficients as the effect on the outcome of changing the variable by one unit while holding the other values constant. But if one variable is the same value, just a different unit of measure, it makes no logical sense to suppose it is being held constant. If temperature in Fahrenheit goes up, temperature in centigrade automatically goes up. For these two reasons, then, the \(X\) variables cannot exhibit perfect multicollinearity.

The problem of perfect multicollinearity arises more often than one might think because of the “dummy variable trap.” The “trap” happens when a categorical variable is converted into a set of dummy variables. Suppose, for example, that one had data on individuals and a variable indicated which age category the individual belonged to, 30-40, 40-50 or 50-60. An analyst might then create three dummy variables, each taking on values 0 or 1 according to the category.

\[age3040_i = \left\{ \begin{aligned} 1 \text{ if age between 30 and 40} \\ 0 \text{ otherwise} \end{aligned} \right.\]

\[age4050_i = \left\{ \begin{aligned} 1 \text{ if age between 40 and 50} \\ 0 \text{ otherwise} \end{aligned} \right.\]

\[age5060_i = \left\{ \begin{aligned} 1 \text{ if age between 50 and 60} \\ 0 \text{ otherwise} \end{aligned} \right.\].

The categories are obviously mutually exclusive. For each row in the dataset, representing the observation for a person, it would be the case that, \(age3040_i+age4050_i+age5060_i = 1\). That is, one of the dummy variables is a linear combination of the other two. This is perfect multicollinearity. One of the dummy variables must be dropped. The interpretation of the coefficients on the included dummy variables then is that they indicate the expected difference between that category and the excluded or dropped category.

7.7 Multiple regression in R

In R, multiple regression is straightforward. Running and presenting the regressions is the easy part; the more challenging part is to interpret the regression coefficients, to understand why the estimated coefficients might be biased (which we cover in the next chapter), and appreciating the art of choosing which variables to include in the regression (which is also covered more extensively in subsequent chapters).

Consider the following script, where we access the Kenya DHS 2022 dataset (and recall, you will need the usual setup and library() commands). We create the log of monthly earnings as the outcome variable.

# Read Kenya DHS 2022 data from a website
url <- "https://github.com/mkevane/econ42/raw/main/kenya_earnings.csv"
kenya <- read.csv(url)
# Create a new variable
kenya$exper <- kenya$age - 6 - kenya$educ_yrs
kenya$log_earn = log(kenya$earnings_usd)
# Run several regressions
reg1 <- lm(log_earn~educ_yrs,
data=subset(kenya,earnings_usd<=1000))
reg2 <- lm(log_earn~educ_yrs+exper,
data=subset(kenya,earnings_usd<=1000))
reg3 <- lm(log_earn~educ_yrs+exper+female,
data=subset(kenya,earnings_usd<=1000))
reg4 <- lm(log_earn~educ_yrs+exper+female+muslim,
data=subset(kenya,earnings_usd<=1000))
# Put regression results in a list
models=list(reg1, reg2, reg3, reg4)
# Make the table with all regressions as separate columns
modelsummary(models,
title="Table: Log earnings vary with education,
experience, and gender in Kenya",
stars=TRUE,
gof_map = c("nobs", "r.squared", "adj.r.squared"),
fmt = 2)

Table 7.1: Table: Log earnings vary with education, experience, and gender in Kenya
	(1)	(2)	(3)	(4)
(Intercept)	3.89***	3.31***	3.63***	3.48***
	(0.02)	(0.04)	(0.04)	(0.04)
educ_yrs	0.10***	0.12***	0.13***	0.13***
	(0.00)	(0.00)	(0.00)	(0.00)
exper		0.02***	0.02***	0.02***
		(0.00)	(0.00)	(0.00)
female			−0.59***	−0.55***
			(0.02)	(0.02)
muslim				0.56***
				(0.03)
Num.Obs.	20829	20829	20829	20829
R2	0.092	0.110	0.162	0.176
R2 Adj.	0.092	0.110	0.162	0.176
+ p < 0.1, * p < 0.05, p < 0.01, * p < 0.001

These commands load the Kenya DHS 2022 data set, run several regressions with log earnings as the dependent or outcome variable and various regressors as the independent or explanatory variables, and display the results. The modelsummary command is used to present results of each regression estimation in a separate column. Before the modelsummary command, a list() command puts the results of each regression in a list, and then modelsummary takes that list to display the regression results. Note that the regressions in columns (2)-(4) are multiple regressions, which have more than one regressor (more than one \(X\) variable). In R we can add regressors simply by including them with a “+” sign, as in earnings_usd ∼ educ_yrs+exper.

Each separate regression is called a specification. Econometrics in practice is about applying knowledge about the relationship– the theory of the relationship– to the data available to determine the most appropriate specification. There are many possible specifications, and sometimes a researcher is tempted to “hunt” for a specification that better fits the data, or yields coefficient estimates that confirm a theory the researcher wants to confirm. Such specification hunting is somewhat unethical, and should be avoided as a matter of good research practice.

Examine the modelsummary command, which creates the table that appears in the Viewer pane or window. Note that we include models, a list of the names of all the regressions (reg1, reg2, ...). The option stars=TRUE puts asterisks next to coefficients according the the p-value of the hypothesis test for the coefficient. The option gof_map is used to have the display display the number of observations in the sample that are used in the regression (the number of non-missing observations) and the goodness of fit indicator R-square. The option fmt=2 keeps numbers to two decimal places. Try running the command again with option fmt=3.

It is very important that you pay attention to the parentheses in a command. The parentheses in every R command must balance, in the sense that for every open parenthesis “(” there is a later close parenthesis “)”. When you highlight and run all the lines from the beginning of the script through the entire modelsummary command, you will see the table of regression results appear in the viewer.

We turn now to discussing how to interpret the coefficients for each of the regressions presented in the table, and how to interpret hypothesis tests and confidence intervals for each variable.

The regression results are presented in Table 1. We have already seen the interpretation for the single variable log-linear regression reported in column (1). Turning to the multiple variable regression in column (2), adding the experience variable changes the coefficient on years of education somewhat. The effect of an additional year is .12, meaning that an additional year raises log earnings by .12, which we interpret as raising monthly earnings by 12%. Each additional year of experience raises monthly earnings by 2%. In the multiple variable regressions of columns (3) and (4) we add two dummy variables, first a dummy variable for whether the person in female, and then a dummy variable for whether the person is Muslims. Being female (compared with being male) lowers earnings by about 59%. Being Muslim implies that earnings are higher by about 56%. We take our interpretation one step further: a Muslim woman earns, on average, about the same as a Christian man, controlling for years of education and experience.

All of the coefficients in all of the specifications are statistically significant at the .1% level. The modelsummary command helpfully displays asterisks in the table next to the estimated coefficients, informing the reader whether the p-value is less that .10, .05, .01, or .001. Three asterisks indicates a p-value less than .001. The table shows us that the p-values for each and every coefficient is below .001. Consider the coefficient on the variable female in specification (3). The coefficient is -.59, while the standard error of the estimated coefficient is .02. The t-stat is the coefficient divided by the standard error, so about 30. Our usual threshold for rejecting the null hypothesis of a zero coefficient is a t-stat value of 1.96, in absolute value. Plainly, the probability of obtaining a t-stat of 30 or larger, with a sample of 20,000, if the true coefficient were zero, is minuscule. You can enter the command 2*pnorm(-abs(-.59/.02)) in R to see just how small. The \(R^2\) and adjusted \(R^2\) are the same (to the third decimal place) because of the large sample size of more than 20,000 observations. The \(R^2\) is about .176 for the specification in column (4); about 18% of variation in log earnings in explained by the four variables included in the regression.

One note of caution! We do not know whether we can properly interpret the effect of education on earnings as a causal effect, so we should be careful about using language of causality here. We will explore this issue in subsequent chapters.

7.8 Prediction using regression results

We can use a regression to predict the \(Y\) value for specific hypothetical or actual values of the regressors using the predict function. Let us do this using the results of regression reg4 and predict the log of earnings for a Muslim woman person in Kenya who has 6 years of education and 20 years of experience (that is, they are 32 years old). To do this, we first create a new dataframe with these values of the \(X\) variables:

newdata = data.frame(educ_yrs = 6, exper = 20,
female = 1, muslim=1)

Next, we use the predict command, telling R which regression coefficients to use (reg3)and which dataframe (newdata):

predict(reg4, newdata)

##    1 
## 4.68

Run these commands and see that the result is predicted log earnings of 3.94. The predict command will also calculate the upper and lower limits of a 95% confidence interval around the prediction. There are two kinds of confidence intervals for predictions. The first is obtained with the option interval="confidence" in the predict command. This gives us the 95% confidence interval for \(E(Y |X)\): that is, what is the expected log earnings over a large number of individuals with this set of characteristics \(X\)? Because our coefficient estimates are uncertain, there is some uncertainty about this expected log earnings, and the CI gives us a plausible range of values.

The second kind of interval is often called a 95% prediction interval, and it is obtained with the option interval="prediction". The prediction interval can be interpreted as a range of values for predicting the test score in a single school district with these characteristics. Because of the residual variation around predicted values, the 95% prediction interval is always wider than the confidence interval of the prediction.

Run these two intervals and see what you get. Is the prediction interval in fact wider?

The prediction is for log earnings, but presumably we want to have the prediction in dollars (as plain earnings). As an approximation, we can take the exponential function and raise it to the predicted earnings: \[ \text{predicted earnings} = e^{\text{predicted log earnings}}\] We can see in R this will give us predicted earnings of about 108 dollars. The following is the code for that:

exp(predict(reg4, newdata))

##   1 
## 108

7.9 Recapitulation, in the style of a character in the novel No Country for Old Men

For fun, we asked ChatGPT to summarize a key idea in this chapter, in the style of one of Cormac McCarthy’s most memorable characters, Anton Chigurh. Here is the generated text:

You got yourself a hypothesis test for a regression coefficient? Well, let me tell you, it ain’t no country club dance. It’s like standin’ at a crossroads, starin’ down fate with a coin in your hand, flippin’ it to decide your future. Here’s how it goes. You got this coefficient, sittin’ there in the middle of your regression, holdin’ its ground like it owns the place. You wanna know if it’s got the strength to stand up against the void, against the vast unknown. First, you gotta set up your hypotheses—like drawin’ lines in the sand. One says this coefficient ain’t nothin’ special, just another number in the mix. The other says it’s got somethin’ to say, somethin’ that cuts through the noise like a bullet through glass. Then comes the test, like a coin toss in the dark. You calculate your test statistic, measure the distance between what you see and what you expect. It’s like holdin’ your breath, waitin’ for that coin to land, revealin’ your fate in a heartbeat. Next, you face the abyss—calculate your p-value. It’s the measure of your defiance, your willingness to stare down chance and dare it to blink. A low p-value, and you got yourself a standoff. A high one, and it’s like the coin’s already fallen, the outcome clear as day. But remember, this test ain’t no friend of certainty. It’s a glimpse into the shadows, a dance with the unknown. Probability ain’t kind, and neither is regression. It’s just you, the numbers, and the stark reality starin’ you down like himself, determinin’ your fate with a coin toss in the dark.

We are not fond of dialect, but this is pretty spot on, ya’ll.

Review terms and concepts: • multiple regression • confounder • omitted variables • “controlling for…” • conditional expected value • expected value • conditional expectation • standardized coefficient • adjusted R-square • specification • perfect multicollinearity • dummy variable trap

6 Log Specifications

8 Understanding omitted variable bias