10 Sampling distributions

10.1 Statistics and parameters

Parameter is a measure characterizing one or more populations or processes. Sometimes, to emphasize that the parameter pertains to the population and not the sample, it is explicitly called a population parameter.

Examples of parameters:

the proportion of blue-eyed people in the world population,
the difference between the average incomes of men and women in Poland,
the probability that the next birth in a given hospital will be a girl.

We will assume that the parameter is a constant (unchanging) value. Usually, we cannot know it directly, so we want to estimate its value. To estimate parameters from the population, we need methods of statistical inference.

Statistic (or: sample statistic) is a number summarizing a (random) sample taken from the population. We will use sample statistics to estimate the unknown value of parameters.

Typical pairs of parameters and statistics along with their notations:

Population mean $\mu$ – sample mean $\bar{x}$.
Population proportion $p$ – sample proportion $\hat{p}$.
Population variance $\sigma^2$ – sample variance $s^2$.
Population standard deviation $\sigma$ – sample standard deviation $s$.
Difference between means in two populations $\mu_1-\mu_2$ – difference between means in the sample $\bar{x}_1-\bar{x}_2$.
Difference between proportions in two populations $p_1-p_2$ – difference between proportions in the sample $\hat{p}_1-\hat{p}_2$.
Correlation in a given population $\rho$ – correlation in the sample drawn from that population $r$.

It is worth remembering that a sample statistic can be treated as a random variable. If we repeatedly take a random sample from a given population, the statistic (e.g., the mean) will take different values. Therefore, it is a random variable. A sample statistic, like any other variable, has its distribution, called the sampling distribution.

If we want to indicate that we treat a statistic, e.g., the sample mean, as a random variable, we can write this statistic with a capital letter, e.g., $\bar{X}$.

10.2 Sampling distribution of the sample mean

If a variable in the population has a distribution with a mean $\mu$ and a standard deviation $\sigma$, then the mean $\bar{X}$ of an n-element sample has a distribution with an expected value $\mathbb{E}(X)=\mu$ and a standard deviation $\sigma/\sqrt{n}$. One can reach this conclusion using the relationships described in section 7.5.3.

The expected value of the sampling distribution of the sample mean is sometimes denoted by the symbol $\mu_{\bar{X}}$ and the standard deviation of the sample mean by $\sigma_{\bar{X}}$. The standard deviation of the sample mean, or its estimate, is sometimes called the standard error of the sample mean.

If the variable in the population has a normal distribution, then the sample mean will also have a normal distribution.

10.3 Central limit theorem (CLT)

As $n$ increases próbkowania średniej z próby będzie dla zmiennej o dowolnym rozkładzie ze skończoną wariancją zbliżał się do rozkładu normalnego.

This is ensured by the central limit theorem.

10.4 Point estimators – biased and unbiased

Point estimator is a sample statistic used to estimate a parameter of the population. Such estimation provides a single number (a point). This obtained value of a point estimator is often called an estimate or point estimate, and the estimation process itself is referred to as estimation.

For example, the sample mean ($\bar{X}$) is a point estimator of the population mean $\mu_X$, and the sample variance ($S^2$) is a point estimator of the population variance ($\sigma^2$).

An estimator $\hat{\Theta}$ is an unbiased estimator for the population parameter $\theta$ if $\mathbb{E}(\hat{\Theta})=\theta$. Otherwise, the estimator is biased⁴.

Examples (see also exercise 10.3):

The sample mean $\bar{X}$ is an unbiased estimator of the population mean because $\mathbb{E}(\bar{X})=\mu$.
The sample variance $S^2$ (computed using the formula (A.72), which has $n-1$ in the denominator, meaning the sample size reduced by one) is an unbiased estimator of the population variance $\sigma^2$: $\mathbb{E}(S^2)=\sigma^2$.
$\mathbb{E}(S) \ne \sigma$, so the sample standard deviation is not an unbiased estimator of the population standard deviation.

10.5 Statistical inference

Sampling distributions provide the mathematical foundation for statistical inference methods. The two main methods of statistical inference are:

estimation using confidence intervals,
hypothesis testing (statistical tests).

A confidence interval is an interval that serves as an estimate of an unknown parameter with confidence level $(1-\alpha)$. If the appropriate conditions are met, the probability that the confidence interval we construct will contain the true parameter is $(1-\alpha)$.

A statistical test is a technique in which we propose two hypotheses about population parameters: the null hypothesis and the alternative hypothesis, and then, based on the data, determine whether we can reject the null hypothesis in favor of the alternative.

In the following chapters, methods for constructing confidence intervals and statistical tests for the simplest parameters will be presented.

10.6 Links

Sampling distribution of the sample mean / Central Limit Theorem

Visualization 1: https://seeing-theory.brown.edu/probability-distributions/index.html#section3
Visualization 2 – continuous variable: https://istats.shinyapps.io/sampdist_cont/
Visualization 3 – discrete variable: https://istats.shinyapps.io/SampDist_discrete/
Visualization 4 – proportion: https://istats.shinyapps.io/SampDist_Prop/
Visualization 5 – proportion: https://learning.statistics-is-awesome.org/threethings/

Central Limit Theorem – explanation on the 3Blue1Brown channel

10.7 Questions

Question 10.1 Check if you know the answers to the following review questions:

What are the similarities and differences between population parameters and sample statistics?
What is a sampling distribution?
What is a point estimator of a population parameter? Provide an example.
What is the difference between a biased and an unbiased point estimator?
What do the symbols $\mu_{\bar{X}}$ and $\sigma_{\bar{X}}$ represent?
What is the relationship between the expected value of the sample mean $\bar{X}$ and the population mean from which the sample is drawn?
What is the relationship between the standard deviation of the sampling distribution of the sample mean $\bar{X}$ and the standard deviation of the characteristic in the population from which the sample is drawn?
What does the Central Limit Theorem state?
Will the sample mean always have an (approximately) normal distribution?

10.8 Exercises

Exercise 10.1 Assume that the body weight distribution of people in a building is approximately normal with a mean of 71 kg and a standard deviation of 15 kg.

What is the probability that 7 people who randomly enter the elevator will have a total weight exceeding 525 kg?

What is the probability that the average body weight of seven randomly selected people entering the elevator will be less than 63 kg?

Exercise 10.2 (McClave and Sincich 2012) The following probability distribution describes a population in which the possible measurement values are 0, 2, 4, and 6. Each of these values occurs with equal probability.

$x$	$\textbf{p}(x)$
0	0.25
2	0.25
4	0.25
6	0.25

List all possible distinct two-element samples that can be drawn from this population (order of values matters).
Compute the mean for each of the samples listed in the previous step.
A two-element sample ($n=2$) is randomly drawn from this population. What is the probability of obtaining a specific sample (for example, the first one listed in part a)?
A two-element sample is drawn from this population. List all possible values of $\bar{X}$ obtained in part (b) and determine the probability of each value. Present the distribution of the variable $\bar{X}$ in table form. Visualize the sampling distribution of the sample mean in a plot.

Exercise 10.3 (McClave and Sincich 2012) The following probability distribution ("population") is given:

$x$	$\textbf{p}(x)$
0	1/3
1	1/3
4	1/3

Find $\mu$ and $\sigma^2$.
Determine the sampling distribution of the sample mean $\bar{X}$ for a randomly selected sample of size $n=2$ (two-element sample) drawn from this distribution.
Show that $\bar{X}$ is an unbiased estimator of $\mu$. [Hint: show that $\mathbb{E}(\bar{X})=\sum \bar{x}\textbf{p}(\bar{x})=\mu$.]
Determine the sampling distribution of the sample variance $S^2$ for a two-element random sample from this distribution.
Show that $S^2$ is an unbiased estimator of $\sigma^2$.

Exercise 10.4 (McClave and Sincich 2012) Assume that an $n$-element sample is randomly selected from a population with a mean $\mu=100$ and variance $\sigma^2=25$.

For each of the following values of $n$, provide the mean and standard deviation of the sampling distribution of the sample mean $\bar{X}$.

$n = 4$
$n = 25$
$n = 100$
$n = 50$
$n = 500$
$n = 1000$

Exercise 10.5 A sample of $n = 64$ observations is randomly drawn from a population with a mean of 20 and a standard deviation of 16.

Provide the mean (expected value) and standard deviation of the sampling distribution of the sample mean $\bar{X}$.
Describe the shape of the sampling distribution of $\bar{X}$. Does the answer depend on the sample size?
Calculate the $z$-score corresponding to $\bar{x} = 16$.
Calculate the $z$-score corresponding to $\bar{x} = 23$.
Determine $\mathbb{P}(\bar{X}<16)$.
Determine $\mathbb{P}(\bar{X}<23)$.
Determine $\mathbb{P}(16<\bar{X}<23)$.

Exercise 10.6 The average number of children per woman in a certain country is 1.31. Assume that the population standard deviation is 0.4. If we randomly select 200 women, what is the probability that the average number of children in this sample will be between 1.26 and 1.36?

Exercise 10.7 A house in a suburban area near Warsaw costs an average of 2.6 million PLN. Assume that the prices follow a normal distribution with a standard deviation of 500,000 PLN. A sample of 25 houses is randomly selected, and the sample mean is calculated. What is the probability that the sample mean exceeds 3 million PLN?

Exercise 10.8 An economist wants to estimate the average household income in a certain population. The population standard deviation is known to be 4,500 GBP. The economist uses a random sample of $n=225$ observations. What is the probability that the sample mean will differ from the population mean by less than 800 GBP?

Exercise 10.9 ABC mountain bikes are displayed in showrooms in Milan at an average price of $700. Assume that the standard deviation of bike prices is $100. If we randomly select 60 bikes, what is the probability that the average price of an ABC mountain bike in this sample will be between $680 and $720?

Exercise 10.10 In a certain city, 50% of families with children have one child, 30% have two children, 15% have three children, 4% have four children, and 1% have five children. A family festival invites 100 randomly selected families with at least one child; each family will bring all their children. The organizers are preparing small gifts for each child attending the festival. They want to be at least 99.9% sure that they have enough gifts for all children.

How many gifts should they prepare?
Assume that the organizers have prepared the number of gifts determined in part (a). What is the probability that at least 15 gifts will remain unused after distributing them to the children?

Literature

McClave, J. T., and T. T. Sincich. 2012. Statistics. Pearson Education. https://books.google.pl/books?id=gcYsAAAAQBAJ.

The Greek letter ${\theta}$ (theta) is often used as a general symbol for any parameter. Thus, in this paragraph, $\theta$ can represent, for example, $\mu$, $\sigma^2$, $\rho$, etc., while $\hat{\Theta}$ can correspond to $\bar{X}$, $S^2$, $R$, etc., meaning the respective sample statistic.↩︎