12 Confidence intervals for a proportion

12.1 Formulas

A proportion in statistics is the share of observations that meet a given condition.

Examples:

The proportion of left-handed people in a population.
The proportion of people supporting the incumbent president in the US elections.
The proportion of defective microprocessors produced by factory A.
The probability that a specific, not necessarily perfectly balanced, die will roll a six.

Since the term 'proportion' in mathematics (and in life) has a slightly different meaning than in statistics, some textbooks use other words. For example, 'fraction' or 'structure index'. Instead of confidence intervals or proportion tests, they refer to fraction intervals/tests or structure index tests.

A proportion multiplied by 100 is a percentage. Therefore, some textbooks use the term 'confidence intervals/tests for percentages'.

We can determine the confidence interval for a proportion using the following formula:

\[\begin{equation} \hat{p}\pm z_{\alpha/2}\sqrt{\frac{\hat{p}\hat{q}}{n}} \tag{12.1} \end{equation}\]

In the above formula, \(n\) is the sample size (number of observations), \(\hat{p}\) is the sample proportion, \(\hat{q}\) is \(1-\hat{p}\), and \(z_{\alpha/2}\) is the appropriate quantile of the standard normal distribution (analogous to the formula (ref?)(eq:ci-meanz-intext)). The component \(\sqrt{\frac{\hat{p}\hat{q}}{n}}\) is the standard error of the sample proportion or, more precisely, an estimate of this standard error.

Conditions for applying the formula:

The sample is random and comes from the target population.
The sample is sufficiently large. For our purposes, let's assume that \(n\hat{p} \geq 15\) and \(n\hat{q}\geq 15\), where \(\hat{p}\) is the sample proportion and \(\hat{q}=1-\hat{p}\). Sometimes these assumptions are relaxed to \(n\hat{p} \geq 5\) and \(n\hat{q}\geq 5\).

In practice, instead of the above simple formula, the so-called Wilson score interval is often used. This approach has many advantages (e.g., it does not return intervals of zero width). The formula for such a constructed confidence interval is quite complex and is not presented here, but it is included in the template (see below).

12.2 Links

Confidence interval for a proportion – visualisation: https://istats.shinyapps.io/Inference_prop/

12.3 Templates

Spreadsheets

Confidence interval for a proportion — Google spreadsheet

Confidence interval for a proportion — Excel template

R code

# Confidence interval for a proportion
# Data:
# Sample size:
n <- 160
# Number of favourable observations:
x <- 15
# Sample proportion:
phat <- x/n
# Confidence level:
conf <- 0.95

# Simple formula:
alpha <- 1 - conf
resa <- phat + c(-qnorm(1-alpha/2), qnorm(1-alpha/2)) * sqrt((1/n)*phat*(1-phat))
# Wilson score:
resw <- prop.test(x, n, conf.level = 1-alpha, correct = FALSE)$conf.int

print(paste(
  list(
    "Confidence interval - simple formula:",
    resa, 
    "Confidence interval - Wilson score:", 
    resw)))

## [1] "Confidence interval - simple formula:"    "c(0.0485854437380776, 0.138914556261922)"
## [3] "Confidence interval - Wilson score:"      "c(0.0576380069455474, 0.148912026631301)"

# Using the binom package
# Sample size:
n <- 160
# Number of favourable observations:
x <- 15
# Confidence level:
conf <- 0.95
# methods="all" returns all methods, to obtain the CI with the "simple method" is method="asymptotic"
binom::binom.confint(x, n, conf.level = conf, methods = "all")

##           method  x   n       mean      lower     upper
## 1  agresti-coull 15 160 0.09375000 0.05667743 0.1498726
## 2     asymptotic 15 160 0.09375000 0.04858544 0.1389146
## 3          bayes 15 160 0.09627329 0.05301161 0.1424125
## 4        cloglog 15 160 0.09375000 0.05494601 0.1449700
## 5          exact 15 160 0.09375000 0.05342512 0.1499102
## 6          logit 15 160 0.09375000 0.05730929 0.1496827
## 7         probit 15 160 0.09375000 0.05616008 0.1472798
## 8        profile 15 160 0.09375000 0.05506974 0.1453215
## 9            lrt 15 160 0.09375000 0.05506409 0.1453210
## 10     prop.test 15 160 0.09375000 0.05523020 0.1525939
## 11        wilson 15 160 0.09375000 0.05763801 0.1489120

Python code

# Confidence interval for a proportion
# Data:
# Sample size:
n = 160
# Number of favourable observations:
x = 15
# Sample proportion:
phat = x/n
# Confidence level:
conf = 0.95

from statsmodels.stats.proportion import proportion_confint
# Simple formula:
resa = proportion_confint(x, n, alpha=1-conf, method='normal')
# Wilson score:
resw = proportion_confint(x, n, alpha=1-conf, method='wilson')

print("Confidence interval - simple formula:", resa, 
"\nConfidence interval - Wilson score:", resw)

## Confidence interval - simple formula: (0.048585443738077556, 0.13891455626192245) 
## Confidence interval - Wilson score: (0.05763800694554742, 0.14891202663130057)

12.4 Zadania

Exercise 12.1 (Based on Aczel and Sounderpandian 2018) A producer of a medicinal skin cream is interested in the proportion of people of a certain age for whom the cream will improve their skin condition. In a random sample of 68 people, a positive effect was observed in 42 cases. What proportion of people can be expected to see an improvement with this cream, with 99% confidence, in the population?

Exercise 12.2 (Based on Aczel and Sounderpandian 2018) Currently, only 1% of households use electricity generated from solar energy. Assume this result was obtained from a random sample of 8000 energy consumers. Please provide the 95% confidence interval for the proportion of solar energy users in the population.

Exercise 12.3 Randomly select thirty locations on Earth, checking whether you land on water or land. Based on this, estimate the proportion of land on Earth's surface. Assume a 90% confidence interval.

Tool for random location selection: https://www.random.org/geographic-coordinates/

Literature

Aczel, A. D., and J. Sounderpandian. 2018. Statystyka w Zarządzaniu. PWN. https://ksiegarnia.pwn.pl/Statystyka-w-zarzadzaniu,731934758,p.html.