13 Sample size
13.1 Determination of sample size when estimating the mean
When planning a study, we might wonder how large the random sample should be to achieve sufficient accuracy.
For example, we know that we will estimate the confidence interval for the mean at a significance level of \(1-\alpha=0.95\) and we want the margin of error to be no more than ±\(e\) (\(e\) stands for the margin of error, which, in this context, is also referred to as the maximum estimation error). In other words, we want the confidence interval to have a width no greater than \(2e\).
By reversing the formula for the confidence interval for the mean (11.1), we obtain the formula for the required sample size in this situation:
\[\begin{equation} n=\left(\frac{z_{\alpha/2}\cdot\sigma}{{e}}\right)^2 \tag{13.1} \end{equation}\]
As can be seen, an assumption about the population standard deviation \(\sigma\) is required. An approximation of this value can be obtained using previous studies on similar populations or based on intuition. To be on the safe side, the assumed value of \(\sigma\) can be slightly overestimated.
A useful heuristic for estimating the order of magnitude is the rule that, in many datasets—especially those where the distribution is approximately normal—almost all values fall within ± three standard deviations from the mean. This means that the range from the minimum to the maximum is about \(6\sigma\). Therefore, if we can assume that all values lie between \(A\) and \(B\), we can divide the difference \(B - A\) by six to obtain an estimate of the standard deviation.
13.2 Determination of sample size when estimating a proportion
Similarly, we obtain the formula for the required sample size to estimate a proportion with a given accuracy of ±\(e\) (in this case, the maximum estimation error is often expressed in percentage points; note that when substituting into the formula, it should be converted into a fraction). By rearranging the formula (12.1), we obtain:
\[\begin{equation} n=\frac{({z}_{\alpha/2})^2 \cdot p \cdot q}{{e}^2} \tag{13.2} \end{equation}\]
In this case, an assumption about the population proportion \(p\) is also required. The function described by equation (13.2) reaches its maximum value for \(p = 0.5\), so if we have no information to assume a different \(p\), the safest approach is to take \(p = 0.5\).
13.3 Rounding
The formulas (13.2) and (13.1) will most likely return a fractional value, but the sample size \(n\) must, of course, be a natural number. Therefore, we should take the smallest natural number greater than the computed result. In other words, we round up, which means using the ceiling function.
13.4 Templates
R code
# Szacowanie proporcji
# Dane:
# Poziom ufności:
conf <- 0.99
# Maksymalny błąd szacunku (e):
e <- 0.01
# Zakładana proporcja (p):
p <- 0.5
# Obliczenia
alpha <- 1 - conf
z <- -qnorm(alpha/2)
ceiling(z^2 * p * (1-p) / e^2)
## [1] 16588
# Szacowanie średniej
# Dane:
# Poziom ufności:
conf <- 0.9
# Maksymalny błąd szacunku (e):
e <- 1
# Zakładane odchylenie standardowe (sigma):
sigma <- 10
# Obliczenia
alpha <- 1 - conf
z <- -qnorm(alpha/2)
ceiling(z^2 * sigma^2 / e^2)
## [1] 271
Python code
# Proportion estimation
import numpy as np
from scipy.stats import norm
# Data:
# Confidence level:
conf = 0.99
# Margin of error (e):
e = 0.01
# Assumed proportion (p):
p = 0.5
# Computations
alpha = 1 - conf
z = -norm.ppf(alpha/2)
print(np.ceil(z**2 * p * (1 - p) / e**2))
## 16588.0
# Mean estimation
import numpy as np
from scipy.stats import norm
# Data:
# Confidence level:
conf = 0.9
# Margin of error (e):
e = 1
# Assumed standard deviation (sigma):
sigma = 10
# Computations
alpha = 1 - conf
z = -norm.ppf(alpha / 2)
print(np.ceil(z**2 * sigma**2 / e**2))
## 271.0
13.5 Exercises
Exercise 13.1 A company analyzing salaries wants to estimate the average salary of top-level managers with an accuracy of plus/minus 2000 dollars, with 95% confidence. Based on previous analyses, it can be assumed that the variance of managers' salaries is about 40,000,000 USD2. What is the minimum required sample size?
Exercise 13.2 How large should a random sample be to determine the proportion of defective components produced in a certain manufacturing process if we want to know this proportion with an accuracy of ±0.05 with 90% confidence? We have no information/assumptions about the assumed proportion of defective components in the population.
Exercise 13.3 A company believes that its market share is 14% (14% of consumers use the company's product). Determine the minimum sample size needed to estimate the actual market share with an accuracy of ± 5 percentage points, with 90% confidence.
Exercise 13.4 Find the minimum required sample size to estimate the average number of branded shirts sold daily. The accuracy should be ±10 units, with a confidence level of 0.9. Additionally, it is known that the standard deviation of the number of shirts sold daily does not exceed 50 units.
Exercise 13.5 As part of an experiment, randomly select points on Earth and use them to estimate the proportion of land on its surface, determining 90% confidence intervals. We know that the true value is 0.29 (29% of Earth's surface is land). What sample size is needed to estimate this with an accuracy of ± 2 percentage points?
Exercise 13.6 (Agresti, Franklin, and Klingenberg 2016) How large a sample is needed to estimate the average annual income of Native Americans in Onondaga County, New York, with an accuracy of 1000 USD with 99% confidence? We have no information about the standard deviation, but we guess that almost all income values fall within the range (0 USD; 120,000 USD) and the income distribution is approximately bell-shaped.