2 Basic inferential statistics

Abstract

This chapter reviews a core inference question addressed in statistics: Is the difference between the means of some variable, calculated for two groups in a sample (e.g., men and women), indicative of a difference in the population? The ideas of a formal hypothesis test and of a confidence interval are introduced. Keywords: Mean, sample variance, t-statistic, p-value, hypothesis test, confidence interval

2.1 Comparing the means of two groups

Social scientists are always comparing means of variables for two (or more) groups. Is this group different from that group? Just looking at the two means is not enough. If the sample size from which the means were calculated were very small, then the difference in the means may have just been the result of chance. Perhaps one sample drew a few high values and the sample for the other group drew a few low values. Moreover, it may be that the variance of the variable, for the observations in the samples for the two groups, may be very high. That would suggest that the population variance was also high. With a lot of variation in values, we would expect that small samples could have very different means for the two groups.

Statisticians have developed a reasonable process to follow to determine whether we should infer that population groups differ, on average, when sample means, drawn from those populations, are different. The term used is “statistical significance.” In order to infer whether the difference in sample means likely reflects a difference in the population means, social scientists calculate a test statistic (a t-statistic) and examine the probability of observing a t-statistic of the value actually observed, or larger, in absolute value.

The null hypothesis that there is no difference in the means between the two groups is rejected when the probability (p-value) of obtaining a t-statistic that one actually obtains is very low, if the null hypothesis were true. If the null hypothesis is rejected, the difference is said to be “statistically significant.”

In this section we review the probability and statistics tools needed to make inferences about differences in population means based on random samples for two groups drawn from the population. We also use R to read in the Kenya DHS 2022 data and use it to calculate differences in means and useful visualizations of comparisons.

2.1.1 1.1 Visualizing the distribution of values for tw groups using the boxplot option in ggplot

A common visualization in data analysis is the box plot, or “box-andwhiskers” plot, which shows the median (the 50th percentile value when a variable is ranked in order), 25th percentile, and 75th percentile of the values for a variable. The 25th, 50th, and 75th percentiles divide the range of the variable’s values into four quartiles (the 1st, 2nd, 3rd, and 4th quartiles). The lines extending from the box are often called whiskers and they extend from the hinge to the largest value no further than \(1.5 \cdot IQR\) from the hinge (where IQR is the inter-quartile range, the distance between the 25th and 75th percentile). Dots mark the remaining observations, and they are sometimes referred to as “outliers.”

Box plots are most interesting when we compare the distribution of values for different subsets of the data, side by side. In this case, let us compare the distribution of monthly earnings between men and women in Kenya. Run the following command in R to generate two box plots. Remember you must use the library commands to load packages and use other commands to set various options, and read in the data as in the previous chapter.

ggplot(data=subset(kenya, earnings_usd<=1000)) +
aes(female, earnings_usd, fill=factor(female)) +
geom_boxplot() +
labs(x="Gender", y="Earnings")
Here is a nice figure!

Figure 2.1: Here is a nice figure!

Note that the first variable in the aes() is the X-axis variable: the groups we want to compare (female in this case). The second variable is the Y-axis: the variable whose distribution we are examining (earnings usd). The fill= option colors each box differently for each X-axis category. The geom_boxplot() option tells ggplot to make a box plot. To create two side-by- side boxplots, the grouping variable for the X-axis must be a factor variable with well-defined levels (and not a numeric or continuous variable). If the grouping variable is numeric, it will have to be turned into a factor variable. This can be done in the command itself, as we have done above.

Figure 2.1 is not so useful, because there are a number of outliers that make the Y-axis range from 0 to 1000, when as we can see most of the observations are actually below 300 USD. We can redo the boxplot, subsetting the sample to include include observations with monthly earnings below 300 USD. Before we do that, we may mention that a new variable can be created that is a factor variable, based on a numeric or on a character variable. Code like this can be adapted to “factorize” a 0,1 variable:

kenya$female_f <- factor(kenya$female,
levels <- c(0,1),
labels <- c("Male", "Female"))

Use this new factor variable, instead of factor(female), and re-run the boxplots, so that now the legend will have “male” and “female” rather than 0 and 1. Run the following code, and notice again how we again use the subset command. We also add an option, stat_summary(), that will calculate the mean for each group and display that also (it normally would be fairly close to the median). There are several options for how to display the mean. Try changing the shape, size, color, and fill, and see the results.

ggplot(data=subset(kenya,earnings_usd<=300),
aes(female,earnings_usd, fill=female_f)) +
geom_boxplot() +
stat_summary(fun=mean, geom="point",
shape=17, size=6, color="red", fill="red") +
labs(x="Gender", y="Earnings")+
theme_bw()
Boxplots of monthly earnings in Kenya DHS 2022 sample, by gender (with sample restricted to those earning less than $300 USD

Figure 2.2: Boxplots of monthly earnings in Kenya DHS 2022 sample, by gender (with sample restricted to those earning less than $300 USD

These side-by-side boxplots are reproduced in Figure 2.2. It is more clear, now, that the median, 25th percentile, and 75th percentile of earnings are higher for men than they are for women, in the Kenya sample. Indeed, it looks like the 25th percentile level of earnings for men is close to the median level of earnings for women. The mean earnings, represented by the red triangles, are almost the same as the median earnings for men, but for women the mean is somewhat higher than the median.

2.1.2 1.2 Calculating the difference in means

In R, the tidyverse package with piping is useful for the purpose of comparing means:

kenya %>% filter(earnings_usd<=1000) %>% group_by(female) %>%
   summarize(earnings_usd=mean(earnings_usd,na.rm=TRUE))
## # A tibble: 2 × 2
##   female earnings_usd
##    <int>        <dbl>
## 1      0         269.
## 2      1         183.

Notice the syntax. The lines of code start with the kenya dataframe, and pipes it into the group by() command that, as the name implies, groups the data by a categorical variable. Then the summarize command calculates the mean earnings for each group, ignoring missing values as per the option nr.rm=TRUE. Note the output in the console. A “tibble” is a special kind of object in R, similar to a dataframe. Indeed, we could assign the results to an object in our environment, rather than have the results be printed out in the console. The following lines of code will do that. Note that we have used the assignment operator <-.

t_mean_earning <- kenya %>% filter(earnings_usd<=1000) %>%
   group_by(female) %>%
   summarize(earnings_usd=mean(earnings_usd,na.rm=TRUE))

After running the lines of code, verify that you see the object t_mean_earning in the environment.

We see that the sample means for the two groups are indeed different (147 for men and 99 for women). Should we infer that the population means are also likely to be different, and that the difference is likely to be that order of magnitude? What if our sample size were very small? What if the variation in earnings is very large? We might not want to make an inference about the difference in the population group means if there was a lot of variation in a small sample.

2.2 Background to inference: Random variables

Now that we have learned how to calculate and visualize comparisons of a variable across different groups in a sample, we return to the question of what we can learn about the broader population from this kind of information.

We make an inference when we attribute to a population something that we observe from entities that come from that population. There are many bad ways to make an inference. For example, the laziest way, known as inferring from anecdote, is to take a sample of one observation and assume that the population is probably just like the one observation. “Kids these days don’t know how to work hard,” is a common such anecdotal inference.

Introductory statistics classes focus on the question: What is a reasonable way to infer what the population mean of a variable is, using observations from a sample drawn from the population?

The standard answer to the “how to reasonably infer” question is fairly straightforward. Draw a reasonably-sized random sample from the population, and then measure the variable of interest reasonably well. Then calculate the sample mean. Since the sample mean depends on the sample that was drawn, it can be thought of as a random variable. A different sample would have led to the calculation of a different sample mean. The probability distribution of the sample mean can be deduced using mathematical reasoning and basic probability theory. Having a good idea of the distribution of the sample mean enables an analyst to make some judgments about what they can infer about the population from their sample. Indeed, one can calculate a confidence interval or conduct a hypothesis test. Alternatively, in a branch of statistics known as Bayesian statistics, a prior estimate of the population mean can be suitably updated with information about the sample mean.

Our focus here is on using statistics to make comparisons between groups in a population, so we will want to draw inferences about differences in means between groups, rather than just the mean of the overall population. To get there, we need to review some basic concepts about random variables and sampling distributions.

2.2.1 Random variables and probability distributions

The concept of a random variable is important in statistics. Consider a variable such as the age of individuals in a population. Clearly age is not the same for everyone. The observations of the variable age could take on many possible values in any given sample. The age of a person drawn at random from the population is not known in advance, and so may be thought of as random—hence, a random variable. By contrast, the age of Michael Kevane, on September 1, 2024, is not a random variable. Michael’s age is what it is. We refer to the specific outcome value taken by a random variable for a particular observation as a “realization” of that variable.

The number of heads that might come up in 3 coin tosses is likewise a random variable. The outcome could be (H,T,T), and so the number of heads would be 1, or it could be (T,H,H), and so the number of heads would be 2, and so forth. We do not know in advance what the number of heads will be. Once we have tossed the coin three times, the outcome is no longer a random variable: it is now data. If the outcome that actually happened had been (T,T,T) the “realized” number of heads would have been 0.

We think of each possible outcome of a random variable as being associated with a probability, which is a number representing how likely that realization is. If the random variable is the number of times heads comes up in three coin tosses, the possible outcomes are 0, 1, 2, or 3. What is the probability of each of these outcomes? Are they equally likely, or will some be more likely than others? If the coin is fair, then the probability of being heads for each toss is 0.50, or one half, and assuming that the outcome of one coin toss does not influence the next, we can calculate the probabilities of each possible outcome. The probability of 3 heads out of 3 tosses is .53 or 0.125. Symmetrically, the probability of 0 heads and 3 tails, is also 0.125. The probability of 1 head and 2 tails, in any order, is 0.375, and the probability of 2 head and 1 tails, in any order, is also 0.375. The probabilities of the four possible outcomes, then, sum to 1.00. Indeed, for mutually exclusive outcomes, the probabilities of all possible outcomes have to sum to 1.

If we take time to think about these probabilities, we might find ourselves in the realm of philosophy. What do we really mean by a probability of an outcome of a random variable being a certain number? Some academics take a frequentist approach, and argue that the probability is the proportion of the times that an outcome would occur, if the outcome of the random variable were observed many, many (infinity) times. Other academics take a subjectivist approach, and argue that the probability is a mental construct, and does not correspond to any physical object in nature, or any “truth.” Supposedly, the philosopher Bertrand Russell once observed, “Probability is the most important concept in modern science, especially as nobody has the slightest notion what it means.” Without taking a strong stand on such philosophical debates, we stick with a common sense understanding of probability: i.e., How likely is it to occur?

The set of probabilities of each possible outcome of a random variable constitutes a probability distribution. For the random variable of the number of times heads comes up in three coin tosses, the probability distribution is discrete. There are only four possible outcomes, and there is a probability associated with each outcome.

The probability distribution of the age, in years, of a person in a population is also discrete. There are about 120 possible outcomes, and if we adopted a frequentist perspective we could look up the age distribution of people in the United States, say, and work with those frequencies as the probabilities. The probability of a random person in the United States being 120 years old would be very small, while the probability of a randomly selected person in the United States being 30 would be about 1.5%. Notice the use of “the probability of a random person” in that sentence. One wants to be careful, and not say things like,“the probability of an old person being over 65 is equal to 1.”

Some random variables can take on so many different values that we refer to the distribution as a continuous probability distribution. The probability of any particular value is tiny. Suppose we allow for fractional values of age? What is the probability of someone in the United States being 32.756483 years old? Perhaps smaller than 1 in 350 million, or 0.0000000028! Therefore, for continuous variables, we talk about the probability of the variable taking a value within a certain range. For the United States in 2020, about 18% of the population was of age less than 18, about 65% between 18 and 64, and about 17% 65 or more years old. These frequencies then are the probabilities of a random person from the United States being in those age ranges.

A special probability distribution that is used repeatedly in econometrics is the standard normal distribution. The standard normal distribution is symmetric and “bell-shaped,”, with mean 0 and standard deviation equal to 1. The formula for the probability density function of standard normal distribution is:

\[f(x) = \frac{e^{\frac{-x^2}2}}{\sqrt{2\pi}}\] The density function shows the basic shape of the distribution and can be used to calculate the probability of observing values \(x\) within a certain range. The formula for the cumulative density function of the standard normal distribution is: \[\phi(x)= \frac1{\sqrt{2\pi}}\int_{-\infty}^{x}e^\frac{-t^2}{2}dt\] The cumulative density function tells us the probability that an outcome falls below a certain number (the \(x\) in the formula). Conversely, the probability of an outcome of a random variable that follows the standard normal distribution being larger than a certain number is just \(1 − \phi(x)\).

These probabilities would be difficult to calculate by hand, and most people would use a computer to carry out the necessary calculations to find \(\phi(x)\). In R, for example, we could calculate \(\phi(−0.2)\) using the command “pnorm(-0.2)”.

Very often, in econometrics, we shall want to calculate the probability that a standard normal random variable would be “more far away from 0 than a certain number.” Shortly we will relate this to the idea of testing a hypothesis. Suppose that the “certain number” is 2.25. Then we want to know the probability of a standard normal random variable being smaller than -2.25 or larger than 2.25. Using the cumulative density function above, we can see that this probability is given by \(\phi(−2.25) + (1 − \phi(2.25))\). This is equivalent to \(2 \cdot \phi(−2.25)\), since the normal distribution is symmetric. Using R, we would enter the command 2*pnorm(-2.25) and see that this probability is 0.0244, or 2.44%.

Illustration of probability calculation, t-stat=2.25

Figure 2.3: Illustration of probability calculation, t-stat=2.25

Figure 2.3 illustrates the calculation. The bell-shaped curve is the density function of the standard normal distribution. The probability of a number being below or above a certain number on the horizontal axis is given by the shaded area under the density (mathematically, the integral), and in this case the probability we want is the sum of the areas of the two shaded “tails.” Suppose, to take a second example, the certain number were 1.32. We want to know the probability of being smaller than -1.32 and larger than 1.32. This probability is given by \(2 \cdot \phi(−1.32)\), since the normal distribution is symmetric. The value turns out to be 0.1868. Figure 2.4 illustrates the probability.

Illustration of probability calculation, t-stat=1.32

Figure 2.4: Illustration of probability calculation, t-stat=1.32

2.2.2 Expectations of random variables

Just as we use the summation sign to calculate the mean of the observations of a variable in a dataset, analogously we can think about measures of central tendency and variation for random variables. The expectations operator is a key concept for doing this. \(E[...]\) is the sum or integral of all possible values of a random variable multiplied by the respective probabilities of the values of the variable. The expected value of the random variable is often called its population mean. We can define the population mean for a discrete distribution as \[E[X] = \mu_X = \sum _{x}xP(X=x)\], summing over all possible values of \(X\). For a continuous distribution, we define the expected value of the random variable, the population mean, as \[E[X]=\mu_X = \int_x xf(x)dx\], where \(f(x)\) is the probability density function. The Greek letter \[\mu\] is pronounced mew, like the sound a kitten makes. Expected values work like sums. If \(X\) and \(Y\) are random variables and \(a\), \(b\), and \(c\) are constants: \[E[a + bX + cY] = a + bE[X] + cE[Y]\]

Note that generally \(E[XY] \neq E[X]E[Y]\). Let’s consider an example of a discrete random variable. Suppose we had a variable \(Y\) with three possible outcomes: 4, 8, or 16, and probabilities 1/4, 1/4, and 1/2, respectively. The expected value is \[E[Y]= 4 * (1/4) + 8 * (1/4) + 16 * (1/2) = 11\]

2.2.3 Conditional distributions and expectations

The probabilities of outcomes may also depend on the level of another variable; e.g., the probabilities of different values of \(Y\) may depend on the level of an \(X\) variable. In such cases we refer to \(X\) and \(Y\) as having a “joint” probability distribution. Continuing our example, suppose when \(X\) is equal to 1, the probabilities of the three outcomes of \(Y\) are the same as before 1/4, 1/4, and 1/2. But suppose that when \(X\) is equal to 2, the probabilities of the three outcomes of \(Y\) are different: 1/2, 1/2, and 0—that is, the large outcome cannot happen when \(X\) is equal to 2. Now the expected value of \(Y\) depends on \(X\). The expected value of \(Y\) conditional on X = 1 calculates the expected value of \(Y\) given that \(X\) is 1. We use the notation of a vertical bar | to indicate “conditional on” or “given,” and the resulting expected value is called a conditional expectation:

\[E[Y|X = 1] = 4 ∗ (1/4) + 8 ∗ (1/4) + 16 ∗ (1/2) = 11\]

Similarly, when \(X = 2\),

\[E[Y |X = 2] = 4 ∗ (1/2) + 8 ∗ (1/2) + 16 ∗ (0) = 6\]

Finally, we could also calculate the expected value of \(Y\) across all possible values of \(X\). To obtain this, we would need to know the probability that \(X\) takes on the value 1 or 2. Let us assume that \(X\) takes on values 1 or 2 with equal probability one half (1/2). Then the overall \(E[Y]\) is given by:

\[E[Y] = E[Y |X = 1 or X = 2] = 11 ∗ (1/2) + 6 ∗ (1/2) = 8.5\]

2.2.4 Population variance

As we saw above in the case of sample statistics, variance is a measure of dispersion or spread. The “population” variance of a random variable uses the probability distribution to calculate the variance, as follows:

\[var(X) = V(X) = E[(X-E[X])^2] = E[X^2] - (E[X])^2\]

And as before, the standard deviation is the square root of the variance:

\[s.d.(X) = \sigma_X = \sqrt{V(X)}\] ## Sampling from a population and inferring differences in population means

3.1 Distribution of a sample mean A sample mean is itself a random variable. It varies from one sample to another. Under random sampling, a sample mean \[\overline{X}\] has the following mean, variance, and standard deviation:

\[E[\overline{X}] = E[X] = \mu_x\] \[V(\overline{X}) = \frac{\sigma_x^2}{n}\]
\[s.d.(\overline{X}) = SE(\overline{X}) = \frac{\sigma_x}{\sqrt{n}}\] where SE stands for standard error.

Under quite general conditions, the Central Limit Theorem tells us that a sample mean is distributed approximately normal if the sample size is reasonably large, and the following value, usually known as a t-statistic, is distributed approximately standard normal (Z): \[\frac{\overline{X}-\mu_x}{SE} = \frac{\overline{X}-\mu_x}{\frac{\sigma_x}{\sqrt{n}}}\] ### Inference about the difference in two sample means

Although we might wish to draw inferences about the overall population mean of a variable, it is more common in the social sciences to be interested in inferences about differences in the means of some variable, for observations drawn from two groups in the population. These kinds of comparisons are ubiquitous, whether we are comparing mean earnings of people of different genders or nationalities, or comparing the outcomes for treated vs. untreated groups in some policy experiment.

Let’s think about how we might go about inferring characteristics of the population from samples. Suppose we had drawn a sample of two men and two women from the population of the United States. The mean age of the two men was 32.6, while the mean age of the two women was 55.0. Should we infer that women are, on average, older than men by more than 20 years, in the United States? Of course not. We recognize that our samples are very small. We should not infer anything about the two groups, men and women, in the population. But if the sample were 2 million men and 2 million women, and were both random samples, then surely we could confidently infer that in this population the two groups indeed had large age gaps. Inferential statistics provides tools for thinking systematically about the degree of confidence we may have in our inferences, and how that depends on such issues as sample size.

The way we proceed in this case is to think of the difference in means of the two samples (one for men and one for women) as itself a random variable. The difference in means is a random variable because if we had taken different samples from the population we would have calculated different sample means, and hence different differences in means. If we took yet another two samples, we would have yet another difference in means.

Our question then is, for the difference in means that we observe, calculated from the samples of men and women that we have, what can we infer about the population as a whole? Specifically, can we infer with some confidence that there is a non-zero difference between the population means of the two groups? Or should we infer that our evidence is consistent with no difference in the population means? And to the extent that we infer that there is a difference, how large is it? Within what plausible range of values?

One important thing to keep in mind about inferential statistics is that we must live with degrees of uncertainty. If we only have a random sample from which to draw inferences about the whole population, we simply do not have definitive, complete evidence about the population, and our inferences will always be subject to some uncertainty. One of our goals as econometricians is to characterize and be honest about the degree of uncertainty in our inferences.

Let’s now address our first question: Can we infer that there is a nonzero difference in the population means? To answer this, we start by positing a null hypothesis that there is no difference in the population mean ages of men and women, and ask whether the observed difference in the sample is large enough that we should abandon (or “reject”) this null hypothesis.

If the null hypothesis of no population difference were actually true, we would still expect to observe some difference in sample means, just due to random variation across samples. Using some assumptions and evidence from the sample, we can calculate the probability that we would observe a difference in sample means as large as the difference we observe, were the population difference actually zero, as hypothesized. Then we will say that we should reject the null hypothesis if this probability is very small. That is, we reject the null of no difference if it is very unlikely that it would yield the sample we observe if it were true. Note that “very unlikely” does not mean impossible! We might have an unusual (not very representative) random sample and, mistakenly, reject a true null hypothesis of no difference. That’s an aspect of living with uncertainty.

To operationalize the hypothesis test, let diff represent the random variable of the difference in sample means across many samples, under the assumption that the null hypothesis is true (zero difference in the population means). What is the probability that diff would be as far from zero as the difference we observed (in either direction)? We can calculate this as:

\[Prob[\text{diff} < - |\text{Observed difference}|] + Prob[\text{diff} > |\text{Observed difference}|]\]

where \(|x|\) stands for the absolute value of a number \(x\), and Observed difference represents the difference in sample means we obtain from our actual sample. This expression is equivalent to:

\[2 * Prob[\text{diff} < - |\text{Observed difference}|]\] if the probability distribution is symmetric. Which it is! In fact, by the Central Limit Theorem, for reasonably large samples, we can assume that the sample mean for each group follows a normal distribution, and so the difference in the means also follows a normal distribution.

So all we need to do is standardize the difference in means so that it follows the standard normal distribution, and calculate the probability area just as in Figures 3 and 4.

We calculate the standardized variable, that we shall generically call a t-statistic, as the difference in means minus the hypothesized population difference in means (which we assume to be equal to zero under the null), divided by the standard error of the difference in means (remember, we think of the difference in means as a random variable, and the standard deviation of an estimator is called the “standard error” by convention). So we have,

\[t-statistic = \frac{\text{Observed difference} - 0}{SE(\text{diff})} = \frac{\overline{X}_\text{men} - \overline{X}_\text{women}}{SE(\text{diff})}\] We know from basic statistics that \(SE(diff)\) can be approximated using the sample data. We calculate it as:

\[SE(\text{diff}) = \sqrt{ \frac{var(X_{men})}{n_\text{men}} + \frac{var(X_{women})}{n_{women}} }\] This t-statistic follows the standard normal distribution. If it is greater than 1.96 in absolute value, by convention we say that we should reject the null hypothesis that there is no difference in the population means. The reason we reject that null hypothesis is that, if the null hypothesis were true, there would only be a 5% probability (the area calculated would only be 0.05) of observing a t-statistic further away from zero than 1.96. That probability is small, so we say, “Wow, we had a very large difference in the two sample means, relative to the standard error of the difference in sample means, so it is pretty likely the actual difference in the population is not zero.”

Turning the calculation around, given a t-statistic and using the standard normal distribution, we can calculate the total probability in the tails of the distribution, which can be interpreted as the probability of obtaining a difference in means at least that far from zero, if the null hypothesis were true. This probability is called the p-value for the test. By convention, a p-value less than 0.05 is often used as the criterion for rejecting the null hypothesis of no difference, at the 5% significance level.

You may remember from your earlier statistics course that this kind of hypothesis test is called a two-sided or two-tailed test, because we calculate the probability of being that far away from the null hypothesis in either direction. Two-sided tests are the convention in most econometrics applications.

2.2.5 Confidence intervals

A hypothesis test is framed as a yes-no test of a particular hypothesis: can we reject a single hypothesized value at a specific significance level, or not? Another way to apply inferential statistics, drawing on the same tools, is “interval estimation.” The intuition here is to use our sample to estimate a confidence interval, a plausible range of values, for our estimate of interest. Confidence intervals are quite commonly reported in journalistic accounts of polling results, where they are referred to as “margins of error.” Although the technical meaning of a confidence interval is somewhat subtle, the calculation is straightforward and provides useful information. Consider the example of a calculation of the difference in means of age between men and women in a sample. Suppose we obtain an estimated difference \[\overline{age}_{women} − \overline{age}_{men} = 1.52\], with an estimate standard error of the difference of SE = 0.61. If the sample were reasonably large (more than 100 observations is usually quite enough), we can write that the 95% confidence interval for the difference in ages is \[CI = diff ± 1.96 · SE = 1.52 ± 1.96 · 0.61 = 1.52 ± 1.20 = [0.32, 2.72]\]. That is, we have “95% confidence” that the population value of the difference lies within that interval. (To be more precise, in 95% of random samples, the confidence interval calculated this way would contain the true population value.)

2.3 Testing hypotheses using R

R can calculate t-statistics and p-values in a snap. Here is the command, for the Kenya data. Note that the mean is being calculated for the variable earnings_usd, while the comparison is being made between groups defined by dummy variable female, which follows the tilde symbol (∼). Hence we are testing the difference in mean earnings between women and men.

t.test(earnings_usd ~female, data=subset(kenya, earnings_usd<=1000),
FUN=c(mean), na.rm=TRUE)

    Welch Two Sample t-test

data:  earnings_usd by female
t = 29, df = 18035, p-value <2e-16
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
 79.7 91.4
sample estimates:
mean in group 0 mean in group 1 
            269             183 

The output in the console will look something like what is above

The results are pretty clear: women have much lower earnings, on average. The t-statistic is 24, with an associated p-value less than 2e-16, which is tiny. The notation e-16 means add 16 zeros after the decimal, then the number 2. So the number is .00000000000000002. This is a very small probability. The men have average monthly earnings of 269 and the women have average monthly earnings of 183.

We can use the datasummary_balance command in the package modelsummary to make a table of differences in means for more than one variable, illustrating how, as open source software, there are often many ways in R to obtain the same results. The datasummary_balance command has the added virtue that if the estimatr package is installed in R, the table will include tests of the hypothesis of difference in means of the various variables for the two groups defined by the variable female.

datasummary_balance(~female, data = kenya %>%
filter(earnings_usd<=1000) %>%
select(earnings_usd, age, educ_yrs, female), stars=TRUE,
title = "Table 1: A good title should go here",
notes = "Note: Superscripts denote p-value according
to +=.1, *=.05, **=.01, ***=0.001")
Table 2.1: Table 1: A good title should go here
0
1
Mean Std. Dev. Mean Std. Dev. Diff. in Means Std. Error
earnings_usd 268.7 223.5 183.1 197.2 −85.6*** 3.0
age 32.9 10.0 32.7 8.4 −0.2 0.1
educ_yrs 9.5 3.8 9.6 4.0 0.1** 0.1
Note: Superscripts denote p-value according to +=.1, =.05, =.01, =0.001

We have left placeholders in the code for the title and any note you might want to appear at the bottom of the table. Notice the syntax after the “data =” option. We use the pipe command to take the dataset into the filter command (to restrict or subset the observations included), and the select command, and there list the variables we want to appear in the table kenya %>% select(earnings usd, age, educ yrs, female).

A table of descriptive statistics for Kenya earnings dataset, by gender, will be displayed in the Viewer in RStudio, and is reproduced in Table 1. The mean and standard deviation for each variable, for each gender, and the difference in means by gender and the standard error of the difference in means are presented.

For the Kenya earnings data, we can obtain a confidence interval for the difference in earnings with the following code:

t.test(earnings_usd ~female, data=subset(kenya, earnings_usd<=1000),
   FUN=c(mean), na.rm=TRUE, var.equal = TRUE)$conf.int
[1] 79.8 91.3
attr(,"conf.level")
[1] 0.95

Which then produces output in the console.

In ordinary text we would write this as CI=[79.8, 91.3] with a comma between the lower and upper bound of the interval. We notice that here the difference in means is positive, while in the datasummary_balance command output it was negative. The commands have different defaults, of whether they subtract the 0 category from the 1 category, or vice versa. Something to pay attention to.

2.4 Recapitulation

To close, we asked ChatGPT to explain, in the style of Cormac McCarthy, how to calculate a confidence interval for the difference in two sample means drawn for different groups from a population. Here is the text generated:

In the desolate landscape of statistical inquiry, where numbers cast their long shadows upon the barren fields of data, one may find themselves compelled to reckon with the elusive notion of certainty. To navigate these treacherous expanses and uncover the hidden truths that lie beneath, one must delve into the heart of comparative analysis. Consider two separate groups, drawn from the same despoiled population, each a testament to its own existence in this unforgiving realm. From these groups, sample means emerge like echoes from distant canyons, whispering tales of what might be. But how does one discern if these echoes bear the weight of reality or merely the fleeting whispers of chance?

Herein lies the quest: to calculate a confidence interval for the difference in these two sample means. This endeavor demands a cautious hand and a steady resolve, for it hinges upon the delicate balance between what is observed and what is inferred. First, the pilgrim of statistics must compute the sample means, those austere guardians of central tendency. Let them be denoted as \(\overline{x}_1\) and \(\overline{x}_2\), representatives of their respective groups, beckoning us to unravel their mysteries. Next, the standard deviations of these intrepid samples must be heeded, symbols of variability in this arid landscape. Let their measured vigilance be known as \(s_1\) and \(s_2\), indices of dispersion to guide our way.

With these companions laid bare, the pilgrim then determines the sample sizes, those vital counts of souls tested within each group. Let them stand as \(n_1\) and \(n_2\), silent witnesses to the rigor of our inquiry. Now, the pilgrim calculates the standard error of the difference in sample means, a beacon of hope amidst the vast expanse of uncertainty:

\[SE(\overline{x}_1-\overline{x}_2) = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}\]

This formidable quantity, born of strife and steadfast calculation, serves as the foundation upon which the confidence interval shall be erected. To forge the confidence interval itself, the pilgrim employs the sacred rites of statistical lore, invoking the mantle of distribution theory and the solemn rites of approximation:

\[(\overline{x}_1 - \overline{x}_2) \pm t^* \cdot SE(\overline{x}_1 - \overline{x}_2)\] Here, t* is the critical value from the t-distribution with degrees of freedom approximated through judicious application of the pooled variance, a ritualistic melding of group identity. Thus, within these bounds lies the revelation sought: a confidence interval, a sanctified zone wherein the difference in means may find refuge from the tempestuous winds of chance. Amidst the ruins of uncertainty, this interval stands as a testament to the pilgrim’s diligence and the enduring quest for understanding amidst the statistical wilderness. In this manner, through the haunting prose of numbers and the austere poetry of calculation, one may confront the challenge of inferring the difference in sample means with the stoic resolve worthy of this bleak statistical landscape.

Bleak, are they, those differences in sample means.

Review terms and concepts: • box plot • inference • random variable • probability distribution • population mean • standard normal distribution • expected value • conditional expectation • hypothesis test • null hypothesis • t-statistic • p-value • statistical significance. • confidence interval