5 Binary variables and randomized controlled trials (RCT)

Abstract

This chapter reviews the econometrics of binary variables, both as outcome variables and as explanatory variables, and then discusses basic econometrics of randomized controlled trials (RCT).

Keywords: Randomized controlled trials, Attrition, Compliers, General equilibrium spillovers, Dummy variables

5.1 Introduction

An important application of regression analysis is for estimation of the effects of a program or policy in the context of a randomized controlled trial (RCT, also written as randomized control trial). Social scientists increasingly use RCT to estimate the impact of a program or policy intended to affect human behavior. RCT are often viewed as much better for estimating impact than using observational data.

In an RCT, the researcher, usually working in cooperation with an organization or government, selects, at random, from the population who are the intended recipients or beneficiaries of the program or policy, some persons or entities to receive the program, and others to not receive the program. Those who are selected to receive the program are the treatment group, and those selected to not receive the program are the control group.

The entities exposed to the treatment might be individuals, households, companies, villages, districts, or states. For example, some individuals might be given vouchers that may be used to pay tuition at private schools. The control group does not receive the vouchers. Some villages may be selected to hold intensive training for residents to effectively engage in local politics. The control group consists of villages that do not get the training.

Key to estimating the effects of a treatment is understanding the basic econometrics of variables that only take on two values, 0 and 1. As we have seen, such a variable is called a binary variable or a dummy variable. We first consider the question of estimating a single variable regression when the outcome variable is a binary variable. This is sometimes called the linear probability model. Then we considered the case where a binary variable is the explanatory variable (the regressor). We then run this kind of regression in R using the Kenya DHS 2022 data, and interpret the resulting coefficient. Following that, we reinterpret the dummy variable as a treatment indicator variable in the context of randomized controlled trials. We then replicate, in R, an important RCT using the actual study data. The nice thing about the RCT study, for our purpose, is that both the outcome variable and the treatment indicator were binary variables.

5.2 A binary or dummy outcome variable

Quite often in regression analysis the outcome variable \(Y\) is a dummy variable that takes on two values, 0 and 1. For example, the variable might be called \(graduated\_college\) and equal 1 for observations where the person graduated from a college or university, and 0 where the person did not. Suppose the explanatory variable were the income of the person’s parents. The model then might be: \[graduated\_college_i = \beta_0 + \beta_1parent\_income_i + \epsilon_i, i=1,...,N\]

When the outcome is a dummy variable that can only take on values 0 or 1, how should we interpret the estimated coefficients? What is the predicted value of the outcome for given level of the explanatory variable?

5.2.1 The linear probability model

The model is often called the linear probability model (or LP model), and as the name suggests, we interpret the coefficient \(\beta_1\) as indicating how much an additional dollar of household income (or, an additional thousand dollars of household income) increases the probability or likelihood of graduating from university. The conditional expectation of the outcome, for different levels of income, is the probability of graduating from university, conditional on a given level of parental income.

To see why the LP model has this interpretation, let’s define the outcome variable as Y, where Y=1 if the event happens and Y=0 if it does not. For example, \(Yi = 1\) if person i graduated from college, \(Yi = 0\) if not. When the dependent variable Y is binary, its expected value or population mean is the probability of the event \(P(Y = 1)\):

\[E[Y] = P(Y=0) \cdot 0 + P(Y=1) \cdot 1 = P(Y=1)\]

Then the fitted or predicted value of \(Y_i\) for observation i, namely \(\hat{Y_i}\), is the predicted probability of the \(Yi = 1\) outcome, given the values of the regressor \(X_i\):

\[ \hat{Y_i} = \hat{\beta_0} + \hat{\beta_1}X_i = P\hat{(Y_i = 1)}\] The estimated slope coefficient is the effect of a change in X on the predicted probability of the event Y=1: \[ \hat{\beta_1} = \frac{\Delta P\hat{(Y_i = 1)}}{\Delta X_i}\]

The model is widely used, despite an obvious problem: how do we interpret situations where the predicted value of the outcome is negative or greater than one? Nothing in the LP model prevents this from happening for some values of \(X\), for any given sample and estimation result. Predictions outside of the range between 0 and 1 do not make sense as probabilities, which must always lie in the range [0, 1]. Social scientists common perspective on this problem is that in the vast majority of cases of applied econometrics it is a theoretical problem rather than a real problem. The sense is that the order of magnitude of expected probabilities of the outcome, for reasonable values of the explanatory variable, or of the effects on the probability of the outcome from reasonable changes in the explanatory variable, will not likely be affected by a different specification. In cases where the specification matters, there are other nonlinear regression specifications (called logit and probit) that constrain the outcome to be between 0 and 1 and thus may be more appropriate. It has become a conventional wisdom, however, that the predicted change in probability estimated by the linear probability model is very similar to that estimated using logit or probit, and econometricians often use the LP model “without apology.”

5.2.2 Estimation with a dummy variable outcome in R

We can easily see how to estimate a linear probability model in R using the Kenya DHS 2022 data. The first thing to do is to create a dummy variable called high earner as being of value 1 when the person earns more than 150 USD. Everyone earning 150 USD or less is then coded as 0. Here is the relevant code. The table(kenya$high earner) command tells us only about one-fourth of the sample individuals are high income earners by this definition.

kenya$high_earner = as.numeric(kenya$earnings_usd>150)
table(kenya$high_earner[kenya$earnings_usd<=1000])
## 
##     0     1 
## 10738 10091

We then run a regression with the dummy variable as the outcome, and the education variable as the explanatory variable. The code for running the regression and displaying the results using modelsummary area here.

reg1 <- lm(high_earner ~ educ_yrs,
data=subset(kenya,earnings_usd<=1000))
modelsummary(reg1,
fmt = fmt_decimal(digits = 3, pdigits = 3),
stars=T,
vcov = 'robust',
gof_omit = "IC|Adj|F|Log|std",
title = 'How much does having a high education affect earnings?')
Table 5.1: How much does having a high education affect earnings?
 (1)
(Intercept) 0.107***
(0.009)
educ_yrs 0.040***
(0.001)
Num.Obs. 20829
R2 0.098
RMSE 0.47
Std.Errors HC3
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001

The results are displayed in Table 5.1. The coefficient on years of education is about .04, indicating that each additional year of education raises the probability that a person would be a high income earner by .04. We might also say that an additional year of education raises the probability by 4 percentage points.

5.3 A binary or dummy explanatory variable

Sometimes the explanatory variable \(X\) in a regression is a binary or dummy variable that takes on two values, 0 and 1. For example, the variable might be called female and equals 1 for observations where the person gender-identifies as female and 0 otherwise. Or the variable might be north and takes on value 1 if the observation is geographically in a northern region, and 0 otherwise. These kinds of variables are called binary variables or “dummy” variables. Dummy variables represent categories, and by convention the 1 value is the variable name which identifies one of the categories (so do not call a variable gender or region since then the reader does not know which region or gender is the 1 category and which is the 0 category).

Interpreting the coefficients on dummy explanatory variables in regression analysis is straightforward: the coefficient represents the expected difference in outcome for the 1 category compared with the 0 category. It is common to represent dummy variables as \(D\). So if we estimated \(Y = β_0 + β_1D = 20 + 2.5D\), then we interpret the coefficient of 2.5 as indicating that the 1 group has an average outcome for \(Y\) that is 2.5 units higher than the 0 group.

5.3.1 Using algebra to help interpret the coefficient

Some algebra using the expectations operator helps with interpreting the coefficient on the dummy variable. We can think of the expected value of the outcome \(Y\) for given values of \(D\) as:

\[E(y_i|D_i = d_i) = E(\beta_0 + \beta_1D_i + \epsilon | D_i = d_i)\]

The dummy variable can only take on two values, 0 or 1, so the conditional expectation is:

\[E(y_i|D_i = 0) = E(\beta_0 + \beta_1D_i + \epsilon_i|D_i = 0)\]

and

\[E(y_i|D_i = 1) = E(β_0 + β_1D_i + ϵ_i|D_i = 1)\] These two equations can be rewritten as,

\[E(y_i|D_i = 0) = E(β_0 + β_1 ∗ 0 + ϵ_i|D_i = 0)\] and \[E(y_i|D_i = 1) = E(β_0 + β_1 ∗ 1 + ϵ_i|D_i = 1)\]

Now, subtracting and recalling that \(E(β_1|D_i) = β_1\) (the true coefficients are not random variables) and that \(E(ϵ_i|D_i) = 0\) by assumption, \[E(y_i|D_i = 1) − E(y_i|D_i = 0) = β_1\]

Box 5.1 contains more algebra, showing that the interpretation of the coefficient of the dummy variable in a single variable regression is indeed the same thing as the difference in the sample means.

Box 5.1: Proof that coefficient on a dummy variable is difference in means]{}

It is a good exercise to review the algebra of the summation sign by proving that the estimated coefficient on a dummy variable is equal to the difference in means, in the single variable regression case.

We start with, \[\begin{equation}\label{eq:coeffbeta1} \widehat{\beta_1} = \frac{\sum x_iy_i - \frac{1}{N} \sum x_i \sum y_i}{\sum x_i^2 - \frac{1}{N} \left( \sum x_i \right) ^2 } \end{equation}\] We can order the observations according to whether the \(x_i\) are equal to 0 or 1. Suppose the first \(N_1\) of them are the 0 values and the next \(N-N_1\) are the 1’s. We can then substitute in 0’s and 1’s and rewrite the numerator of equation \(\ref{eq:coeffbeta1}\) as, \[\begin{equation} \widehat{\beta_1} = \frac{\sum_{i=1}^{N_1} 0*y_i + \sum_{i=N_1}^{N} 1*y_i- \frac{1}{N}( \sum_{i=1}^{N_1} 0 + \sum_{i=N_1}^{N} 1) \sum_{i=1}^{N} y_i}{\sum x_i^2 - \frac{1}{N} \left( \sum x_i \right) ^2 } \end{equation}\] Or, noting that the summation of zeros is just zero, and the summation of 1’s from \(N_1\) to \(N\) is \(N-N_1\), and breaking down the summation of the \(y_i\) into two parts, \[\begin{equation} \widehat{\beta_1} = \frac{ \sum_{i=N_1}^{N} y_i- \frac{1}{N} (N-N_1) (\sum_{i=1}^{N_1} y_i + \sum_{i=N_1}^{N} y_i) }{\sum x_i^2 - \frac{1}{N} \left( \sum x_i \right) ^2 } \end{equation}\] Note that the squared term in the denominator \(\left( \sum x_i \right) ^2\) can be broken according to the two values of \(x\) and thus simplified, recalling that \((a+b)^2 = a^2 + b^2 + 2ab\) but the equivalent here of \(a\) is \(\sum_{i=1}^{N_1} 0\) which is equal to 0. So, we can rewrite the denominator of the equation as, \[\begin{equation} \frac{ numerator} {\sum_{i=1}^{N_1} 0^2 + \sum_{i=N_1}^{N} 1^2 - \frac{1}{N}(\sum_{i=1}^{N_1} 0)^2 - (\sum_{i=N_1}^{N} 1)^2 - 2*\sum_{i=1}^{N_1} 0*\sum_{i=N_1}^{N} 1) } \end{equation}\] Or, \[\begin{equation} \widehat{\beta_1} = \frac{ \sum_{i=N_1}^{N} y_i- \frac{1}{N}(N-N_1) (\sum_{i=1}^{N_1} y_i + \sum_{i=N_1}^{N} y_i)} {{(N-N_1)} - \frac{1}{N} (N-N_1)^2 } \end{equation}\] Divide every term by \((N-N_1)\), \[\begin{equation} \widehat{\beta_1} = \frac{ \frac{\sum_{i=N_1}^{N} y_i}{(N-N_1)}- \frac{1}{N} (\sum_{i=1}^{N_1} y_i + \sum_{i=N_1}^{N} y_i)} {{1} - \frac{1}{N} (N-N_1) } \end{equation}\] Simplifying the denominator, \[\begin{equation} \widehat{\beta_1} = \frac{ \frac{\sum_{i=N_1}^{N} y_i}{(N-N_1)}- \frac{1}{N} (\sum_{i=1}^{N_1} y_i + \sum_{i=N_1}^{N} y_i)} {\frac{N_1}{N}} \end{equation}\] Rearranging, \[\begin{equation} \widehat{\beta_1} = \frac{N}{N_1(N-N_1)}\sum_{i=N_1}^{N} y_i- \frac{1}{N_1} (\sum_{i=1}^{N_1} y_i + \sum_{i=N_1}^{N} y_i) \end{equation}\] Which is then equal to: \[\begin{equation} \widehat{\beta_1} = \frac{N}{N_1(N-N_1)}\sum_{i=N_1}^{N} y_i - \frac{N-N_1}{N_1(N-N_1)} \sum_{i=N_1}^{N} y_i - \frac{1}{N_1} \sum_{i=1}^{N_1} y_i \end{equation}\] And then, \[\begin{equation} \widehat{\beta_1} = \frac{1}{(N-N_1)}\sum_{i=N_1}^{N} y_i - \frac{1}{N_1} \sum_{i=1}^{N_1} y_i \end{equation} \] Which is the difference in mean outcomes, \(\bar y_{x_i=1} - \bar y_{x_i=0}\).

5.4 Estimation with a dummy variable in R

We can use R to estimate earnings with a binary variable for education. The first thing to do is create a dummy variable.

kenya$high_education = as.numeric(kenya$educ_yrs>=10)

The explanatory variable is now a dummy variable, taking on value 1 if the education attainment level is 10 or above. We can run the single variable regression, and display results using modelsummary.

reg2 <- lm(earnings_usd ~ high_education,
data=subset(kenya,earnings_usd<=1000))
summary(reg2)
## 
## Call:
## lm(formula = earnings_usd ~ high_education, data = subset(kenya, 
##     earnings_usd <= 1000))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##   -286   -131    -53     78    822 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      154.52       1.99    77.6   <2e-16 ***
## high_education   131.01       2.81    46.6   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 203 on 20827 degrees of freedom
## Multiple R-squared:  0.0943, Adjusted R-squared:  0.0943 
## F-statistic: 2.17e+03 on 1 and 20827 DF,  p-value: <2e-16
modelsummary(reg2,
fmt = fmt_decimal(digits = 3, pdigits = 3),
stars=T,
vcov = 'robust',
gof_omit = "IC|Adj|F|Log|std",
title = 'How much does having a high education affect earnings?')
Table 5.2: How much does having a high education affect earnings?
 (1)
(Intercept) 154.523***
(1.556)
high_education 131.010***
(2.810)
Num.Obs. 20829
R2 0.094
RMSE 202.95
Std.Errors HC3
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001

Table 5.2 shows the results. The estimated coefficient on the dummy variable is about 131. Being an individual with a very high level of education increases monthly earnings by almost 131 USD. Recall that the mean monthly earnings, in the data, is about 220 USD, so this is a very large increase. The \(R^2\) is about .09, so 9% of variation in earnings is explained by whether or not the person attained a high level of education. The RMSE (equivalently, the SER) is about 202, so the standard error of the regression is about the same magnitude as the mean of the outcome variable (monthly earnings). Both the SER and \(R^2\) are telling us that a lot of the variation in the outcome is not explained by the regression model. This is to be expected, and is not in any way “bad.” Earnings are highly idiosyncratic, depending on many different attributes of people that are unlikely to be measured in any dataset.

We can also calculate the goodness of fit measures for the regression with the following code.

kenya_subset <- kenya %>% filter(earnings_usd<=1000)
reg2 <- lm(earnings_usd ~ high_education,
data=kenya_subset)
b_hat <- coef(reg2)
earnings_usd_pred <- b_hat["(Intercept)"] +
b_hat["high_education"] * kenya_subset$high_education
# And sum of squares
ESS=sum((earnings_usd_pred-mean(kenya_subset$earnings_usd))^2)
TSS=sum((kenya_subset$earnings_usd-mean(kenya_subset$earnings_usd))^2)
SSR=sum((kenya_subset$earnings_usd-earnings_usd_pred)^2)
# Calculate R-square
# R-square option 1
ESS/TSS
## [1] 0.0943
# R-square option 2
1 - SSR/TSS
## [1] 0.0943
# SER or RMSE
sqrt((SSR/(nrow(kenya_subset)-2)))
## [1] 203

Finally, we can verify that the estimated coefficient is indeed equal to the average difference in earnings for individuals with high levels of education attainment and those without.

kenya %>% filter(earnings_usd<=1000) %>% group_by(high_education) %>%
summarise(mean(earnings_usd, na.rm=TRUE))
## # A tibble: 2 × 2
##   high_education `mean(earnings_usd, na.rm = TRUE)`
##            <dbl>                              <dbl>
## 1              0                               155.
## 2              1                               286.

In this code, we pipe the dataframe into the filter() and group by() commands, and then use the summarise() command to calculate means for each group.

5.5 Econometrics of RCT

The basic idea of a randomized controlled trial (RCT) is to estimate the effect of a program or policy. The way to do this is to calculate the difference, in outcome of interest, between the sample treatment group and the sample control group. That it, the average of the outcome is calculated for the treatment group, and the average outcome is calculated for the control group, and the difference in the two averages is considered to be the sample average treatment effect (ATE).

The sample ATE can also be estimated in a regression framework, where the researcher has data on the outcome \(Y\) and an indicator or dummy variable for treatment assignment, \(Treat\) that takes on value one if the entity is in the treatment group and zero if the entity is in the control group. The following regression mode is estimated:

\[Y_i = β_0 + β_1Treat_i\]

where \(\hat{β_1}\) is the estimated difference, on average, between the treatment group and the control group. This estimated coefficient is sometimes referred to as \(\hat{β}^{ATE}\), where ATE refs to average treatment effect. This coefficient is the estimate of what we think of as a population, or true average treatment effect, \(β^{ATE}\).

In an RCT, assignment to the two groups in random. This is a key assumption, and is equivalent in meaning to our assumption in OLS that \(E(ϵ_i|X = x_i) = 0\). Recall that this assumption is that other factors that determine the outcome are not correlated with the explanatory variable. Since in this case the explanatory variable was determined randomly, by definition other variables are not ex ante correlated with the explanatory variable.

When we discuss multiple regression, we will see that control variables can be added to the regression to increase overall explanatory power and hence likely reduce the standard errors of the estimated coefficient of interest. For now, though, we focus on the single variable regression case.

5.6 Examining an RCT in R

To reinforce understanding of the single variable regression case, we shall replicate the analysis of an interesting RCT published in 2019. Alsan, Garrick, and Graziani (2019) conducted an RCT in Oakland, California. They opened and operated medical clinics on 11 consecutive Saturdays, some with Black and some with non-Black male medical doctors. They then recruited Black patients to seek medical consultations at the clinics. The recruitment was done at barbershops and local flea markets that were predominantly frequented by Black men. Men eligible for the program were approached by field officers, and offered incentives to participate and attend one of the clinics. Recruitment and subsequent take-up was quite high; 1,374 persons were recruited, and 637 eventually participated. The study randomized which doctors patients would see, for preventative care appointments. Some doctors were Black and others non-Black. Before the actual consultation, patients completed preliminary questions and were informed about the characteristics of the doctor they have been randomly chosen to see for the appointment. The patients selected which preventative care treatments they thought they would want to receive. These treatments included non-invasive measurements ((blood pressure measurement, body mass index measurement) and invasive measurements (cholesterol testing and diabetes testing from a small blood sample) and also a flu vaccine. Then they had the consultations, and together with the doctor finalized the preventative care treatments.

The study measured outcomes before (pre-consultation) and after (postconsultation). The outcomes were indicators of whether the patient was likely to follow-up with a more intensive medical intervention. Would they get a flu shot? Would they have the clinic check their blood pressure? Would they allow the clinic to draw blood to measure their cholesterol levels? The hypothesis of the authors was that many Black Oakland residents were distrustful of the formal medical system, and would likely refuse subsequent non-urgent medical interventions (such as preventative vaccinations and blood tests) unless they were more likely to trust the doctor. The main finding of the research was that the differences in take-up of preventative behaviors induced by consulting with a “racially concordant” doctor could, in the words of the authors, could “reduce the black-white male gap in cardiovascular mortality by 19 percent.”

Just for fun, we asked ChatGPT for a jazzed-up version of the study findings. Here is the study, in the style of Cormac McCarthy:

In the parched streets of Oakland, California, an inquiry took root, probing the arcane nexus where the hues of skin and the art of healing intertwined. The research sought to fathom the sway of physician diversity upon the longing for preventive care among African American men. Through an experiment forged in randomized paths, Black men found themselves steered toward either kindred or disparate male medical practitioners. The study, wrought in a bifurcated design, plumbed their choices ere and after consultation, dissecting their predilections for the armamentarium of preventive measures. Before the consultation, the subjects mirrored in their preferences for preventive care. Yet, as they stood face to face with their fated physicians, a profound shift unfurled. In the realm of racially matched healers, the men took to every offering of preventive service with a vigor, notably the more invasive interventions. These revelations augured that physicians of akin racial ilk could whittle the chasm in cardiovascular mortality between Black and white men by a significant 19 percent. Thus, amidst the dry whispers of empirical rigors, the findings rippled with the weight of redemption, a testament to the unspoken alchemy of skin-deep symphony in the art of medical counsel.

The data from the study is available in a replication package Alsan, Marcella, Owen Garrick, and Grant Graziani (2019). Anyone can download the data and the software code (written in the statistical computing software Stata, an alternative to R) and replicate their analysis. We have placed the main replication data file in an accessible form on a website, and the following code will read the data into R. The data is stored in Stata format, so we use the haven package and its command read dta to read in the data.

url <- "https://github.com/mkevane/econ42/raw/main/oakland_analysis_final.dta"
oakland <- read_dta(url)
oakland <- oakland %>% select(tag, black_dr,pre_bmi, post_bmi,
pre_cho, post_cho, pre_flu, post_flu, pre_bp, post_bp,
pre_dia, post_dia)
oakland <- oakland %>% filter(tag==1)

In the third line we use the pipe syntax to select some of the variables that we shall use for the analysis. We select only the variables for the various interventions and whether the patient saw a Black doctor, or not, and a variable, tag, indicating whether the record is the principal record. The data contains duplicate record for each patient to facilitate other analyses. We then use the filter() command to keep only records for the 637 study participants.

The first thing to do is create a table of descriptive statistics. For that, we can use the datasummary_balance command in the package modelsummary. Since we know that the treatment variable is black_dr, with value 1 if the assigned doctor was Black, we can calculate means and standard deviations for the treatment and control groups separately.

datasummary_balance(~black_dr, data = oakland %>%
select(black_dr,pre_bmi, post_bmi,pre_cho, post_cho, pre_flu,
post_flu, pre_bp, post_bp, pre_dia, post_dia),
stars=TRUE, title = "Table 1: Were there differences in
cholesterol and flu vaccine interventions?",
notes = "Note: Superscripts denote p-value according
to +=.1, *=.05, **=.01, ***=0.001")
Table 5.3: Table 1: Were there differences in cholesterol and flu vaccine interventions?
0
1
Mean Std. Dev. Mean Std. Dev. Diff. in Means Std. Error
pre_bmi 0.5 0.5 0.5 0.5 0.0 0.0
post_bmi 0.6 0.5 0.8 0.4 0.2*** 0.0
pre_cho 0.3 0.5 0.4 0.5 0.0 0.0
post_cho 0.4 0.5 0.6 0.5 0.3*** 0.0
pre_flu 0.3 0.5 0.4 0.5 0.0 0.0
post_flu 0.3 0.5 0.4 0.5 0.1** 0.0
pre_bp 0.6 0.5 0.6 0.5 0.0 0.0
post_bp 0.7 0.5 0.8 0.4 0.1** 0.0
pre_dia 0.4 0.5 0.4 0.5 0.1 0.0
post_dia 0.4 0.5 0.6 0.5 0.2*** 0.0
Note: Superscripts denote p-value according to +=.1, =.05, =.01, =0.001

Notice that in the syntax for the datasummary_balance() command we use a pipe operator to select the variables we want to appear in the table. In this case the variables are the same as the variables we had earlier selected, except for the variable tag. Normally, a dataframe would contain many variables that would not be included in the table of descriptive statistics, and so this option to select variables for the table is very useful.

The sample means and standard deviations are in Table 3. The column for difference in means shows clearly that for both outcomes the choices were higher after the experiment had started (after seeing a racially concordant doctor) by substantial amounts. For cholesterol the proportion indicating they wanted the treatment went up by .30, or about 30 percentage points. For the flu vaccine, the probability of wanting the vaccine increased by .1, or about 10 percentage points.

We turn then to regression analysis (which of course is virtually the same thing as the table of means, as we saw above). We examine each outcome measured after the treatment started (the post period). For each outcome, we run a single variable regression, with the explanatory variable being a dummy variable for whether the Black patient had a Black doctor for their consultation, or not.

reg1 <- lm(post_bmi~black_dr, data=oakland)
reg2 <- lm(post_cho~black_dr, data=oakland)
reg3 <- lm(post_flu~black_dr, data=oakland)
reg4 <- lm(post_bp~black_dr, data=oakland)
reg5 <- lm(post_dia~black_dr, data=oakland)
models <- list("BMI"=reg1,"Cholesterol"=reg2,"Flu shot"=reg3,
"Blood pressure"=reg4,"Diabetes"=reg5)
modelsummary(models,
fmt = fmt_decimal(digits = 2, pdigits = 2),
stars=T,
vcov = 'robust',
title= "Did seeing a black doctor influence
subsequent medical intervention choices?",
gof_omit = "IC|Adj|F|Log|std")
Table 5.4: Did seeing a black doctor influence subsequent medical intervention choices?
 BMI Cholesterol Flu shot Blood pressure Diabetes
(Intercept) 0.60*** 0.36*** 0.32*** 0.72*** 0.42***
(0.03) (0.03) (0.03) (0.03) (0.03)
black_dr 0.16*** 0.26*** 0.11** 0.11** 0.21***
(0.04) (0.04) (0.04) (0.03) (0.04)
Num.Obs. 637 637 637 637 637
R2 0.030 0.067 0.014 0.016 0.044
RMSE 0.46 0.48 0.48 0.42 0.49
Std.Errors HC3 HC3 HC3 HC3 HC3
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001

Notice in the code we assign (using the assignment operator <- the results of each of the regression models to a different “object” (reg1, reg2, reg3, reg4, reg5). Then we put these results into a list (and call this list models). Then we use the modelsummary command to display the results that are in the list names models.

Table 5.4 reproduces the results of the regressions. The coefficients on whether the patient saw a Black doctor are all positive and statistically significant. For these five outcomes, seeing a doctor who likely was better able to establish trust would likely improve improve subsequent medical care. Since the regression models are linear probability models, we interpret the coefficients as the change in the probability of the patient agreeing to request the follow-up treatment or consultation or procedure. A patient seeing a Black doctor is 16% more likely to get the BMI measurement, 26% more likely to get a blood test for cholesterol, 11% more likely to get a flu vaccination, 11% more likely to get a blood pressure test, and 21% more likely to get a diabetes blood test.

The \(R^2\) for the regressions vary from about .07 to .01. While these may seem low, it is important to note that medical decisions taken by individuals likely have a lot of variation that is explained by particular characteristics of individuals (and so explained by variables that are unlikely to be measured in any dataset). So an \(R^2\) of .01 may be perfectly reasonable. In any case, the focus in regression analysis is not on the overall explanatory power of the regression, but on the hypothesis test about the particular coefficient on the explanatory variable of interest.

The results of the experiment suggest that greater attention should be paid to recruiting and giving incentives for Black doctors to serve Black patients, as the likely improvement in health outcomes would likely be larger than the cost associated with such programs. The legacy of discrimination in the United States has costs that endure into the present, and should be remedied on both efficiency and equity grounds.

5.7 Caveats to reliance on RCTs to estimate impacts

There are four common real-world aspects of RCTs that mean that an estimated effect of the treatment may be biased, or may not be the right estimate for the parameter that we actually want to measure. The first is that there may be non-compliance or low take-up of the treatment. The second is that the treatment might spill over to the control group. Depending on the nature of the program, some of the control group may be “treated,” For instance, if the program involves providing information to the treatment group, and sharing of information is inexpensive, then many in the control group might learn the information, and so effectively be “treated.” The third is if there is non-random attrition from the sample. The fourth is when there is non-random response bias.

We consider each in turn.

5.7.1 Non-compliance in the treatment group

Individuals are selected to be part of the treatment group, in most realistic RCT, have the option to not comply with he treatment, or to only partially comply with the treatment. In the case of substantial non-compliance, the estimated effect of the treatment will usually be biased towards zero. Nonrandom non-compliance might, however, bias the estimated treatment effect in the opposite direction, if individuals (or the entities being treated) have knowledge suggesting that for them the treatment may have effects operating in the opposite direction of the intended effect. Such scenarios are not hard to construct, although they may not be very likely. Imagine a training program, where some individuals think that completing a training might render them vulnerable to social sanctions or, in the case of training for political engagement, to government retaliation. They might understand that the treatment would have negative effects on their well-being, and so they do not comply with the treatment. The estimated treatment effect might then be positive, since those who complied with the treatment were those who did not think sanction or retaliation likely, while those who would have had negative effects did not comply and so they had zero effect. In sum, the effect of the treatment might be estimated to be somewhat positive, when it might well have been overall a zero effect (the negative effects counterbalanced by the positive effects). This is a somewhat artificial example, but political scientists running RCT that involving incentives to engage politically or to empower individuals in non-democratic political situations must be attentive to how non-compliance may be influenced by personal influential of the treated “subjects.”

When there is substantial non-compliance in the treatment group, the interpretation of the estimated coefficient in equation (18) is somewhat different. It is now called the intention-to-treat estimate or the intent-to-treat estimate (ITT). As the discussion above suggested, usually, but not necessarily, \[|\hat{\beta}^ITT| < |\beta^{ATE}|\] as there will likely be less difference between the treatment and control group, since many in the treatment group are not complying.

In situations where there is significant non-compliance, a researcher might be tempted to estimate a slightly different regression, replacing the variable Treat in equation (18) with a dummy variable Actuallytreated if the researcher can observe or measure whether the person or entity complied with the treatment. In some RCT, this may be relatively straightforward, while in other RCT where the treatment group is “assigned” to carry out certain unobserved or unmonitored actions, this may not be possible. In the case where it is observed, the estimated effect would likely be biased if compliance were non-random. In that case, actually taking-up the treatment, or complying with the treatment, may be correlated with some unobserved characteristic of the person. That is, the explanatory variable Actuallytreated is likely correlated with the error term. A common way to recover a consistent estimate of the treatment effect is to use the Treat variable as an “instrumental variable” for the Actuallytreated variable. This instrumental variables estimate of the treatment is known as the local average treatment effect, or LATE. The econometrics of the technique of instrumental variables is too complex for this book, and the interested reader is referred to Angrist and Pischke (2009) and Angrist and Pischke (2014).

The literature has been very creative about labeling the various types who can “ruin” an RCT. Defiers, never-takers, spiters, fun-seekers. An important but somewhat neglected area of research is what to do with clearly identified members of these groups. If the researcher becomes aware of instances of these types, should their observations be included or excluded? If their selfreported measurements are nonsense, and extreme outliers, how should a researcher handle that? These are practical questions for which there are variety of suggestions. The general principle, though, to make sure such choices made by the data analyst are reported (clearly) in the analysis report, and attention paid to the robustness of estimates to different choices.

5.7.2 Spillover effects

Non-compliance can also happen among the control group, in the very specific sense that some on the control group may receive or acquire the treatment. If the program is to impart knowledge to the treated (about techniques to manage a business, for example), then this knowledge might quickly spill over to members of the control group. It is easy to see that the greater the spillover, the more the estimate of the treatment effect is biased towards zero, because there will be less and less difference between the treatment and control groups.

5.7.3 Non-random attrition

Non-random attrition from the sample can lead to a biased estimate of the program impact, when attrition is correlated with the outcomes. Consider four possibilities. Suppose, for example, that the program is so successful that many in the treatment group have such spectacular outcomes that their lives have changed for the better so much that they are no longer available to complete questionnaires or be interviewed. That voucher to go to medical school led some to become busy neurosurgeons or emergency room trauma doctors. They can no longer take the time to participate in the study. In this event, the cases where the treatment was most effective, leading to extraordinary outcomes, are very likely to no longer be in the sample. Those in the treatment group who remain in the sample are cases where the treatment was not so effective. The estimate of \(\hat{β_1}\) will be lower in magnitude than the true treatment effect.

To take a second case, suppose that those in the control group with the worst outcomes drop out of the sample. This is a significant issue when studying the effects of interventions to improve the health of children who are likely to die from diarrheal diseases or malnutrition. Some children in the control group may die, and not be in the sample at the endline (when the study ends). The estimate of \(\hat{β_1}\) will be lower in magnitude than the true treatment effect. This tragic possibility raises questions, of course, about the ethics of the particular RCT. It may be the case that survival to the endline (not dying) is itself the outcome, and researchers are uncertain about the efficacy of a treatment, which may have bad side effects. In this case the death is recorded and is the outcome, so there is no bias in the estimated effect of the treatment on the probability of survival. But if, say, nutritional status at endline were the outcome, then those who had died would no longer be measurable, and the bias would have to be corrected with appropriate statistical methods.

The third and fourth cases are similar. In the third case, some in the treatment group with small effects of the treatment and hence low outcomes might feel disappointed, and refuse to participate in the endline survey. The treatment will look effective because only those with good outcomes remain in the treatment group. The estimate of \(\hat{β_1}\) will be greater in magnitude than the true treatment effect. In the fourth case, those in the control group who have good outcomes drop out of the sample, leaving only those with bad outcomes.

Of course, these four cases do not exhaust the possibilities in the real world. there may be mixes of attrition, with some persons attritting from the control group and others attritting from the treatment group. No prediction is possible, then, without strong assumptions about the intensity of the selection into attrition on either side. Social scientists have developed techniques to bound the degree to which differential non-random attrition could bias estimates. For example, one could assume that all those attritting from the treatment group would have had the worst outcome among the treated, while all those attritting from the control group would have had the best outcome of the control group. If the estimated effect is then still positive, then it is likely to be a lower bound of the true effect.

5.7.4 Response bias

Many of the outcomes measured in a social science RCT will be selfreported, and it is always possible that individuals will bias their responses depending on whether they were selected for the treatment or the control group. RCT in the social sciences are thus particularly vulnerable to response bias because individuals or entities may know that they are assigned to the treatment group (and they often know that they are assigned to the control group, having heard about the treatment). This revelation of information happens because there is usually no possibility of a placebo. Researchers must be attentive to this possibility of response bias, and endeavor to include outcomes that are measured by third parties or are measures that are easily verified (that is, more “objective” measures).

5.8 Concluding thoughts

The practice of econometrics is about trading off model complexity against model parsimony, while being attentive to the credibility, confidence (in a sense informed by statistical analysis), ease of interpretation, and robustness to alternative plausible specification choices of the estimated coefficients. In subsequent chapters, the meaning of these terms shall become more clear.

Review terms and concepts: • dummy variable • expectations operator • randomized controlled trial (RCT) • average treatment effect (ATE) • non-compliance • spillover effects • non-random attrition • response bias