10 Recapitulation of and practical tips for regression analysis

Abstract

This chapter recapitulates the basics of multiple regression analysis and discusses several practical issues in conducting regression analysis. We also present several examples of regression analysis.

Keywords: Hypothesis test, confidence interval, bias, least-squares

10.1 Recapitulating the multiple regression model

The multiple or multivariable regression model, as discussed in previous chapters, is:

\[Y = β_0 + β_1X_1 + β_2X_2 + · · · + β_kX_k + ϵ\]

where the $β_k$ are the parameters describing how a change in the associated explanatory variable $X_k$ affects the outcome. Using a random sample from a population of interest, the method of ordinary least squares generates estimates of each coefficient, $\hat{\beta}_k$ along with estimates of the standard error of each coefficient, $SE(\hat{\beta}_k)$. The standard error of each coefficient may be used to conduct hypothesis tests.

A dataset consists of observations of different variables for a number of entities, or cases. An entity of observation may be a person, household, firm, region, or country at a certain point in time. So a dataset may consist of the education, age and wage of a group of persons. Or the GDP, infant mortality rate and average education level for countries.

A regression is a method for estimating the parameters $β$ of a linear relationship $Y = βX$ where one or more exogenous or independent variables (X) are explaining or causing an endogenous or dependent variable (Y ). The method is to find values for the parameters that minimize the sum of squared deviations of the predicted value of the dependent variable, $\hat{Y}$ , from the actual value of Y .

Some variables are not continuous, but rather take on values zero or one (0-1). These variables are called dummy variables. We could have a dummy variable as the outcome variable, and that is called the linear probability model.

The estimated parameters of a linear relationship are called coefficients. The output of a multiple regression calculation is a set of coefficients. The coefficients are interpreted as marginal effects: how much the Y variable might be expected to change with a one unit change in the X variable, holding all other variables constant. The coefficients estimated for dummy variables tell us how much the outcome changes when the dummy variable changes from zero to one. A common specification with a dummy variable as explanatory variable is in estimating the effects of a treatment in a RCT.

10.2 Inference and statistical significance

Most datasets are either samples of a population or interpreted as samples of a population (the set of countries that exist in the world is interpreted as a sample of the countries that have or might have existed). The variables included in a dataset are generally measured with error. The econometrician understands that other factors are also likely responsible for explaining the outcome. For these reasons, we may think of the regression analysis as an exercise in inference. We have imperfect and limited observations on a sample of the population, and we would like to infer something about the population from our sample of imperfectly observed and limited number of variables. We do not think of our coefficients as being the exact true values, but rather as being possibly near or far from the true values.

What determines how much confidence to have in our estimates? We use probability theory in order to make reasonable assumptions about the distribution of the estimators that we use to calculate coefficients. We calculate a test-statistic (or t-stat, for short) associated with each coefficient that is estimated in the regression. The Central Limit Theorem is invoked to show that the t-stat follows, with large samples, the standard normal distribution. We then determine the probability that a sample would generate a t-stat at least as far from zero as the t-stat that we calculate (that is, we typically begin with a null hypothesis that the coefficient is equal to zero and so the X variable has no effect on the outcome). The convention in the social sciences is to reject the null when the t-stat indicates the probability of being zero is less than 5%. That is, if the t-stat value is greater than 1.96 (or the p-value less than .05), the null hypothesis is rejected, because if the true parameter were zero, it is very unlikely that the value of the coefficient we found using the sample that we have would have been so large relative to its estimated standard error. If we reject the null hypothesis, we say that the estimated coefficient is “statistically significant.”

The t-stat depends on the magnitude of the estimated coefficient, and it’s estimated standard error. The standard error in tun depends on the sample size, the overall explanatory power of the regression, the degree of variation of the exogenous variable whose coefficient the regression is estimating, and the correlation of the explanatory variable with the other explanatory variables. The larger our sample (number of cases) the more confidence we may have. The better the overall fit of the regression the more confidence we have. The more variation there is in our explanatory variables the more confidence we may have, but if a variable is closely correlated with other explanatory variables then we have less confidence in the estimated coefficient (i.e., the problem of multicollinearity).

We note that an important goal of a regression analysis is to determine the magnitude of an effect with some degree of precision. That is, the goal is to construct a confidence interval or test a hypothesis about the magnitude. Different assumptions and understandings lead to different methods for calculating standard errors. An assumption discussed earlier is whether to characterize the error in the regression model, denoted by $ϵ_i$, as homoskedastic or as heteroskedastic. Two different formula for calculating standard errors follow from the assumptions. A key phrase to remember in hypothesis testing is: “the probability of being as far away from or more far away as…”

There are other approaches to calculating standard errors and thus determining statistical significance. For example, errors may be clustered according to certain kinds of groups. Children in a classroom share the same teacher, and so unobserved influences of their teacher will be correlated, and so this means the errors are correlated for those observations, and the calculation of standard errors for regression coefficients should be modified accordingly. In other situations, observations may be clustered geographically, and so unobserved factors relevant in explaining the outcome may be correlated. There are more complex formula for calculating these clustered standard errors. Time series analysis has its own quite distinct approaches to calculating standard errors, because of the correlation of observations over time.

It is worth remembering that statistical significance (rejecting a null hypothesis that the coefficient is equal to zero) is different from social or economic significance. One might imagine a set of studies, for example, estimating the effect of gun control laws on crime. The studies might estimate a negative effect that is statistically significant. But the estimated effect might be quite small, where a large tightening of gun control laws, very controversial and costly to enforce, only reduced armed assaults by one percentage point in a year. In that situation, informed by social science, citizens might agree that effect was not large enough to justify a change in policy. The statistical significance of the finding is irrelevant if the social of economic significance is viewed as trivial in magnitude, especially relative to the costs of change.

Our preference is to encourage regression analysts to avoid phrases like “Our hypothesis proved to be valid” in discussing regression coefficients that are statistically significant. Similarly, avoid declaring, ”Because the p-value is low, this proves that there is a relationship between X and Y.” The word “prove” is not a good word to use when interpreting the output of regression analysis. We always prefer the more modest phrase “suggests there is an association…”

10.3 Interpreting coefficients

The key basic phrase to remember when interpreting a regression coefficient for a continuous X variable is this: A one unit change in $X_k$ is associated with a $β_k$ change in the outcome. For a dummy variable the key phrase is this: the coefficient $β_k$ is the expected average difference between the category equal to one and the omitted, or zero, category. For more complex regression specifications, such as polynomial specification, log specifications, specifications with binary outcomes, or specifications with interaction terms, interpretation of coefficients is more complex. Box 10.1 contains some examples and discussion of various cases.

The scale of a regressor can always be changed in order to facilitate understanding of the magnitude of a coefficient. For example, if an explanatory variable in a cross-country regression is population, the effect of a one unit change in the population of a country (adding one more person) is likely to be tiny, so a coefficient might be 0.000000573, and when rounded to three decimals will be 0.000. This coefficient might be statistically significant (the hypothesis of zero coefficient can be rejected) and yet the regression output table might display 0.000***, which makes the table hard to interpret. Dividing the population variable by one million, so that a one unit change is interpreted as a change in the population of one million, would then generate a coefficient of 0.573*** and this is now more relevant and straightforward to interpret.

A point to recall is that when discussing the relative magnitudes of coefficients it is sometimes useful to convert coefficients to “standardized coefficients.” These are interpreted as the number of standard deviations by which the outcome changes when the explanatory variable changes by one standard deviation. Discussing marginal effects in standard deviations can be more useful than marginal effects in units, especially when comparing the influence of different explanatory variables. For example, if southern counties with 500 more slaves per white person in 1850 had vote shares associated with the Republican presidential candidate increase by 2 percentage points in the various elections after 2000, while southern counties with 100 more kilometers of railroads in 1850 had vote shares associated with the Republic presidential candidate decrease by 1 percentage points, it is not obvious which variable was more “influential” in explaining the vote shares of contemporary elections. If, however, the coefficients were expressed as standard deviation changes in slave incidence and railroad kilometers, they are more easily interpreted.

When continuous variables are expressed as proportions or percentages, the wording or phrasing for interpreting a coefficient becomes somewhat less intuitive, and more care should be taken to get the wording right. Consider a regression model (notice we usually write variables in lower-case, because it is easier!):

\[prop\_trump = β_0 + β_1prop\_pop\_over65 + β_2prop\_evang + ϵ\]

where the units of observation, or the entities of observation, are the 3141 counties of the United States, and the variables are characteristics of the counties. Suppose that $prop\_trump$ is the proportion of the vote in the 2020 Presidential election that went to Republican candidate Donald Trump. In the data this number might range from .15 to .75. If the proportion of vote for Trump was .15, this is the same thing as saying that 15% of the vote went to Trump. Percent, in customary usage, is just proportion times 100. The explanatory variables measure the proportion of the county population over 65 years old, $prop\_pop\_over65$, and the proportion of county resident identifying as Evangelical Christians, $prop\_evang$. By many accounts, older voters were more likely to vote for Trump, as were Evangelical voters, in the 2020 election. As researchers, we are interested in the magnitude of the relationships, and testing the null hypothesis that the coefficients are equal to zero. We might worry that our model is too parsimonious; presumably there are many other characteristics of county residents that explain the vote share for Trump, and some of them are likely correlated with our two included variables, and so the estimated coefficients suffer from omitted variable bias. That is, there are other confounders out there, and we should try to measure them (or find datasets that measure them) and so include them in an expanded regression model. But for now, we are focusing on interpreting coefficients, so let us ignore OVB problems.

Suppose we estimate the regression model using data from the 2020 election, and obtain,

\[prop\_trump = .48 + .132prop\_pop\_over65 + .147prop\_evang + ϵ\]

The estimated coefficient on the variable $prop\_pop\_over65$ is .132, and the estimated coefficient on the variables $prop\_evang$ is .147. We could say that a one-unit change in $prop\_pop\_over65$ is associated with a .132 change in $prop\_trump$, and that would be correct, but that is not really what we mean by interpreting (rendering in plain English) the coefficient. In this case, we might say that an increase of .10 in the proportion of the county population that is over 65, that is, an increase of ten percentage points, is associated with an increase in the vote share for Trump of .0132, or about 1.32 percentage points. So, a 10 percentage point increase in the elderly population is associated with an increase in the Trump vote share of 1.32 percentage points. Notice how in English it is common to mention both the change in the proportion as well as the equivalent change in terms of percentages. Turning to the second coefficient, we might say that an increase in the proportion of evangelicals in a county of .10, or 10 percentage points, is associated with an increase in the share of residents voting for Trump of .0147, or 1.47 percentage points.

Consider now a similar regression with slightly different variables. Suppose the vote share for Trump now in the dataset is represented as the percent vote share for Trump, and so varying from 30 to 75, for example. A dummy variable is included in the regression for whether the county has at least one evangelical megachurch that can host more than 10,000 persons per week for church services. \[percent\_trump = .48 + 16.2prop\_pop\_over65 + 3.45megachurch + ϵ\] Now we interpret the coefficient on the dummy variable as indicating that on average counties with megachurches saw Trump with 3.45 more percentage points in vote share than for counties without megachurches. A .10 increase in the proportion of the county population that is over 65, that is, an increase of ten percentage points, is associated with an increase in the vote share for Trump of 1.62 percentage points.

One of the most difficult tasks in regression analysis is to correctly interpret a coefficient in words. This is especially true when outcome or explanatory variables are in proportions or percentages. Econometricians have to become comfortable and familiar with the English usage differences between a change of one percentage point and a percentage change of one percent. Consider when a variable expressed as a percent increases from 99% to 100%. This is a change of one percentage point, and using the formula that percentage change equals new minus old divided by old, multiplied by 100, this is a change of \[\frac{(100-99)}{99} * 100=1.010101 \text{ percent} = 1.010101\%\]

Alternatively, consider when a variable that is expressed as a percent increases from 2% to 3%. This is a change of one percentage point, and using the formula that percentage change equals new minus old divided by old, multiplied by 100, this is a change of

\[\frac{(3-2)}{2} * 100=50 \text{ percent} = 50\%\] So a percent change and a percentage point change can be very difference. It is also important to remember that in common English usage a proportion, say .52, when multiplied by 100 is equivalent to 52%. When a variable is expressed as a proportion, and thus must be between 0 and 1, we are rarely interested in interpreting the coefficient as, “What is the effect on the outcome of the proportion increasing by one, that is, going all the way from 0.00 to 1.00?” Instead, a researcher might be interested in the effect of a change in the proportion of 0.10. “What is the effect on the outcome of a change in the explanatory variable, which is measured as a proportion, of 0.10?” Note that a change of 0.10 in a variable expressed as a proportion is equivalent to saying that the variable increases by ten percentage points.

Box 10.1: Interpreting coefficients in complex model specifications

Interpreting coefficients in more complex regression models depends on how the outcome Y and explanatory variables of interest X are measured. Outcome variables can be of three types: (1) Y1, continuous, (2) Y2, the natural log of a continuous variable, or (3) Y3 a dummy variable that equals 1 or 0. Similarly, explanatory variables can be of three types: (1) X1, continuous, (2) X2, the natural log of a continuous variable, and (3) X3, a dummy variable that equals 1 or 0. Various combinations of explanatory and outcome variables then lead to different wording for interpreting the coefficients. Consider several specifications:

Specification 1: ${Y_1} = \beta_0 + \beta_1{X_1} + \beta_2 ln{X_2} + \beta_3{X_3}$, where outcome is continuous variable \[\begin{align*} \beta_1 &=\frac{\partial Y_1}{\partial X_1} \Rightarrow \mbox{1 unit change in ${X_1}$ generates a $\beta_1$ unit change in $Y$}\\ \beta_2 &=\frac{\partial Y_1}{\partial ln{X_2}} \Rightarrow \mbox{1\% change in ${X_2}$ generates a $\beta_2/100$ unit change in $Y$}\\ \beta_3 &\Rightarrow \mbox{$Y$ for the 1 category in ${X_3}$ is $\beta_3$ higher than the 0 category} \end{align*}\]

Specification 2: $lnY_2 = \beta_0 + \beta_1{X_1} + \beta_2{ln{X_2}} + \beta_3{X_3}$, where outcome is logged continuous variable \[\begin{align*} \beta_1 &=\frac{\partial lnY_2}{\partial X_1} \Rightarrow \mbox{1 unit change in ${X_1}$ generates a $100*\beta_1$\% change in ${Y}$}\\ \beta_2 &=\frac{\partial lnY_2}{\partial ln{X_2}} \Rightarrow \mbox{1\% change in ${X}$ generates a $\beta_2$\% change in ${Y}$}\\ \beta_3 &\Rightarrow \mbox{$Y$ for the 1 category in ${X_3}$ is $100*\beta_3$\% higher than the 0 category} \end{align*}\]

Specification 3: ${Y_3} = \beta_0 + \beta_1{X_1} + \beta_2ln{X_2} + \beta_3{X_3}$, where outcome is dummy 0-1 variable \[\begin{align*} &\beta_1 =\frac{\partial Y_3}{\partial X_1} \Rightarrow \mbox{a 1 unit change in ${X_1}$ generates a $100*\beta_1$ percentage}\\ &\mbox{\hspace{1cm} point change in the probability ${Y_3}$ occurs}\\ &\beta_2 =\frac{\partial Y_3}{\partial ln{X_2}} \Rightarrow \mbox{a 100\% change in ${X_2}$ generates a $100*\beta_2$ percentage}\\ &\mbox{\hspace{1cm} point change in the probability ${Y_3}$ occurs}\\ &\beta_3 \Rightarrow \mbox{the probability that ${Y_3}$ occurs is}\\ &\mbox{\hspace{1cm} $100*\beta_3$ percentage points higher for the $X_3=1$ category} \end{align*}\]

10.4 Goodness of fit

For calculating the goodness of fit of a regression, the statistics SER and $\overline{R}^2$ (adjusted $R^2$) are used. Many students think a goal of regression analysis is to “achieve” a high $R^2$. But a high $R^2$ is a bad objective to maximize.

The goal of regression analysis is emphatically NOT to choose to include variables until one has the highest $\overline{R}^2$. Firstly, many outcomes studied in the social sciences tend to be unpredictable since they is driven by intention rather than following a mechanical pattern. A very high fit suggests that one can predict the future behavior of humans with some accuracy, but this is unlikely to be the case. Secondly, various combinations of variables, or specifications, will produce different $R^2$ values. When determining the “best” specification, social scientists rely on theoretical frameworks—–whether implicit or explicit—–that deduce which variables are relevant in explaining the outcome. A specification that includes a variable unrelated to the underlying theory may not be ideal, even if it results in a higher $R^2$.

The very idea of $\overline{R}^2$ is to indicate the importance of parsimony, by penalizing the econometrician for adding more and more variables. An econometric model, like any social science model, is a simplification of a complex social process intended to enable better understanding of the process. Adding more and more variables to a regression often makes it harder to understand the underlying causal process. There are no “rules” about what makes a good model, or the right number of variables to include in a regression analysis. Many econometricians favor presenting a variety of specifications in order to indicate that a main relationship of interest is robust to different sets of included variables.

Figure 1: A discussion of parsimony from the 1960s

Moreover, parsimony is also a standard virtue in the social sciences, so if a set of variables explains much of the explainable variation, and another set that includes tens more variables only improves the goodness of fit by a relatively small amount, the parsimonious model is often preferred. Finally, social scientists increasingly eschew having one single specification as the “best” or “preferred” specification, and are more comfortable with modestly describing a set of specifications as being equally plausible and generating reasonably similar results. If different specifications differ substantially in their estimated coefficients, especially for the main variables of interest for the problem at hand, social scientists increasingly are noting that perhaps the data, variables, and measurements available are not of sufficient quality to estimate the hypothesized relationship with a high degree of credibility.

Figure 2: Reproduction of regression results: Nunn (2008) article on legacy of slave trade

Figure 2 reproduces a table of regression results from Nunn (2008), “The Long-term Effects of Africa’s Slave Trade,” published in The American Economic Review. In the article, Nunn calculated a measure of how many slaves were “exported” (forcibly) from African countries, divided by the land area of the country. He used data from an important project by historians to track down and digitize ship manifests, recorded by customs agents in port cities in the Americas during the slave trade (which lasted from the 1600s until the mid-1800s). Of course, when the slaves were taken, there were no modern countries, so Nunn ascribes slaves to countries according to their ethnic groups. Ethnicity is inferred from the names of slaves, or was sometimes recorded along with the names.

In the regression table, the explanatory variable of interest is the log of this measure of exports per unit of area, and the outcome variable is the log of real per capita GDP in 2000. The regression is intended to contribute to the discussion of how much the slave trade in the past may have accounted for lower income in African countries in the present. That question is a bit of a muddle, as are many big picture counterfactual questions in history, but perhaps the effort and methods are worth the haziness of the question?

How can we interpret the meaning of the coefficient estimate of -.112 in column 1? The table indicates that the explanatory variable for that coefficient is the log of slave exports. The outcome is in logs. The specification, then, is a log-log specification. So a 1% increase in exports of slaves is associated with a .112% decline in real per capita GDP (since both are in logs,the interpretation is % change in X associated with % change in Y ). The estimate is statistically significant, but we might want to ask about the magnitude. A 50% increase in slave-raiding in a country during the slave trade appears to be associated with a decrease in per capita GDP of 5%. If GDP was about $1,000 per person in the typical African country in 2000, then 5% is about $50 per person. If there are about one billion people in the countries affected, this is about $50 billion per year. That is a sizable amount.

The value 0.024 reported in column 1 underneath the coefficient is the standard error of the estimated coefficient. Recall that we think of the estimated coefficient as having a probability distribution since we think of it as a random variable; other samples would have generated other estimates of the coefficient. This sampling distribution has a standard deviation. This standard deviation, called the sampling error when the distribution is from variation due to sampling, is calculated using the data from the sample. The formula in the simple single regressor case was given earlier in the book. The coefficient estimate of -.112 in column 1 has three asterisks next to it. These asterisks represent the results of the hypothesis test that the true coefficient is equal to zero. We conduct the hypothesis test by constructing the t-statistic (or, t-stat), the estimated coefficient divides by the standard error. By the Central Limit Theorem the t-stat follows the standard normal distribution, so we find the probability of observing an estimate at least as far away from null (where beta =0) as we actually observe, assuming the null were true. If that probability were less than .01, we reject the null (and put three asterisks next to coefficient). As the footnote to the table indicates, if that probability were less than .05, we reject the null at the 5% level and put two asterisks next to the coefficient.

Let’s interpret the meaning of the coefficient estimated for French legal origin in column 5 (0.643). The regressor is a binary variable indicating if the country’s legal code has origins in the French legal code. The outcome is in logs, so taking the point estimate at face value, a country being of French legal origin is associated with a 64.3% greater per capita GDP than not being of French legal origin, other factors controlled for. You might observe, however, that this coefficient is not statistically significantly different from zero.

The meaning of the coefficient estimated for ln(oil prod/pop), in column 5, is simple: A 1% increase in oil production per capita is associated with a .078% increase in real per capita GDP (since both are in logs).

The coefficient on ln(exports/area) is negative and significant in all 6 specifications, and does not change that much in magnitude. We might tentatively suppose that indeed the relationship between the magnitude of the slave trade and present-day income per capita is negative, though we might want to carefully examine the validity or credibility of this supposed causal relationship.

Review terms and concepts: • multiple regression model • inference • statistical significance • interpreting coefficients • proportions, percents, and percentages • goodness of fit and parsimony • log-log specification

9 Regression extensions: Polynomial functions and interaction terms

11 Assuring that estimation of causal effects are credible