11 Confidence intervals for a mean

11.1 Confidence interval and confidence level

A confidence interval is an interval that serves as an estimate of an unknown parameter with a confidence level of \((1-\alpha)\). If the appropriate conditions are met, the probability that the confidence interval we determine will contain the true parameter is \((1-\alpha)\).

The confidence level \((1-\alpha)\) is the probability, expressed as a fraction or percentage, that a randomly determined confidence interval will contain the true value of the estimated parameter.

11.2 Interval Estimation of the Mean

In our course, we will estimate the population mean based on random samples in two ways:

The formula using the appropriate quantile of the standardized normal distribution ("z" formula): In the first case, we assume that the sample is sufficiently large (for the purposes of this course it means that⁵ \(n \geqslant 30\)). We will then use the central limit theorem and apply the appropriate quantile of the normal distribution \(z_{\alpha/2}\).
The formula using the appropriate quantile of the t-Student distribution ("t" formula): In the second case, for small samples, estimation will only be possible if we can assume that the variable's distribution in the population is (at least approximately) normal. We will then use the t-Student distribution and the quantile \(t_{\alpha/2}\).

In the first case (z), we apply the following formula:

\[\begin{equation} \bar{x} \pm z_{\alpha/2} \frac{s}{\sqrt{n}}, \tag{11.1} \end{equation}\]

where \(\bar{x}\) is the sample mean, \(s\) is the sample standard deviation, \(n\) is the number of observations (sample size), and \(z_{\alpha/2}\) is the quantile of the standard normal distribution such that \(P(Z > z_{\alpha/2})= \alpha/2\). The component \(\frac{s}{\sqrt{n}}\) is called the standard error or, more precisely, the estimate of the standard error of the sample mean.

In the second case (t), we apply the following formula:

\[\begin{equation} \bar{x} \pm t_{\alpha/2} \frac{s}{\sqrt{n}}, \tag{11.2} \end{equation}\]

where the value \(t_{\alpha/2}\) is the appropriate quantile from the t-Student distribution.

The t-Student distribution has a single parameter (denoted as \(\nu\) or d.f., short for degrees of freedom) – the so-called "degrees of freedom." For confidence interval estimation, we take \(\nu = n-1\).

11.3 Additional Notes

Since for large samples (for large \(n\)), the confidence interval based on the \(t\) statistic yields results similar to those based on the \(z\) statistic, in practice, the \(t\) formula is usually used even for large samples.
The "z" formula can also be used when the sample is small, provided that both of the following conditions are met:
- The population distribution is approximately normal, and
- The population standard deviation is known.

The second condition is rarely met in practice.

11.4 Links

Confidence Intervals – Visualization 1: https://rpsychologist.com/d3/ci/

Confidence Intervals – Visualization 2: https://seeing-theory.brown.edu/frequentist-inference/index.html#section2

Confidence Intervals – Visualization 3: https://college.cengage.com/nextbook/statistics/utts_13540/student/html/simulation5_2.html

Confidence Interval for the Mean – Visualization 4: https://istats.shinyapps.io/Inference_mean/

11.5 Templates

Spreadsheets

Confidence interval for the mean — Google spreadsheet

Confidence interval for the mean — Excel template

R code

# Confidence interval for the mean
# Data:
# Sample size:
n <- 24
# Sample mean:
xbar <- 183
# Population/sample standard deviation:
s <- 5.19
# Confidence level:
conf <- 0.95

alpha <- 1 - conf

# z:
ci_z <- xbar + c(-qnorm(1-alpha/2), qnorm(1-alpha/2)) * s/sqrt(n)

# t:
df<- n-1
ci_t <- xbar + c(-qt(1-alpha/2, df), qt(1-alpha/2, df)) * s/sqrt(n)

print(paste(
  list(
    "Confidence interval - z:",
    ci_z, 
    "Confidence interval - t:", 
    ci_t)))

## [1] "Confidence interval - z:"              "c(180.923605699976, 185.076394300024)" "Confidence interval - t:"             
## [4] "c(180.808455203843, 185.191544796157)"

# Using raw data:
dane <- c(34.1, 35.6, 34.2, 33.9, 25.1)
test_result<-t.test(dane, conf.level = 0.99)
print(test_result$conf.int)

## [1] 23.85964 41.30036
## attr(,"conf.level")
## [1] 0.99

Python code

import math
import scipy.stats as stats

n = 24
xbar = 183
s = 5.19
conf = 0.95
alpha = 1 - conf

ci_z = [xbar + (-stats.norm.ppf(1-alpha/2)) * s/math.sqrt(n), xbar + (stats.norm.ppf(1-alpha/2)) * s/math.sqrt(n)]

df = n-1
ci_t = [xbar + (-stats.t.ppf(1-alpha/2, df)) * s/math.sqrt(n), xbar + (stats.t.ppf(1-alpha/2, df)) * s/math.sqrt(n)]

print("Confidence interval - z:", ci_z,
"\nConfidence interval - t:", ci_t)

## Confidence interval - z: [180.92360569997632, 185.07639430002368] 
## Confidence interval - t: [180.80845520384258, 185.19154479615742]

# Version 2

print(stats.norm.interval(confidence=conf, loc=xbar, scale=s/math.sqrt(n)), "\n",
stats.t.interval(confidence=conf, df=df, loc=xbar, scale=s/math.sqrt(n)))

## (180.92360569997632, 185.07639430002368) 
##  (180.80845520384258, 185.19154479615742)

# Using raw data:
import numpy as np
from scipy import stats

dane = np.array([34.1, 35.6, 34.2, 33.9, 25.1])
test_result = stats.ttest_1samp(dane, popmean=np.mean(dane))
conf_int = test_result.confidence_interval(0.99)
print(conf_int)

## ConfidenceInterval(low=23.85964498330358, high=41.300355016696415)

11.6 Questions

Question 11.1 Fill in the blanks:

Other things being equal, the the sample size, the narrower the confidence interval.
Other things being equal, the the confidence level, the wider the confidence interval.
Other things being equal, the the value of \(\alpha\), the wider the confidence interval.
Other things being equal, the the sample standard deviation, the wider the confidence interval.
Other things being equal, on average, the the population standard deviation, the wider the confidence interval.

11.7 Exercises

Exercise 11.1 A study examined the calorie content of a standard breakfast served at a university cafeteria. Based on a random sample of 100 breakfasts, the average calorie count was found to be 321, with a standard deviation of 24 calories. What is the 90% confidence interval for the mean calorie count?

Exercise 11.2 (Based on Aczel and Sounderpandian 2018) A mining company wants to estimate the average amount of ore per ton of extracted copper ore. A random sample of 50 tons was selected, each representing a single observation. The sample mean was 66.75 kg, with a standard deviation of 15.20 kg. Determine the confidence interval for the average ore content per ton at confidence levels of 95%, 90%, and 99%.

Exercise 11.3 (Based on Aczel and Sounderpandian 2018) A battery manufacturer wants to estimate the average battery life in small electronic devices. A sample of 12 batteries was taken, yielding a mean of \(\bar{x}\) = 34.2 hours and a standard deviation of \(s\) = 5.9 hours. Determine the 95% confidence interval for the mean battery life. What assumptions must be made?

Exercise 11.4 An HR advisory company wants to estimate the average salary of individuals in managerial positions in banking. a random sample of 50 managers was taken, yielding \(\bar{x}\) = 22,539 zł and \(s\) = 8,790 zł. Provide a 90-percent confidence interval for the average salary of managers in banking.

Exercise 11.5 (Based on Aczel and Sounderpandian 2018) An art dealer wants to estimate the average value of artworks of a specific type from a certain period. To do this, they obtained a 20-element sample and calculated the mean (5,139 zł) and the standard deviation (640 zł). What is the 95-percent confidence interval for the average value of such artworks from the studied period? What assumptions need to be made?

Exercise 11.6 Determine the 99-percent confidence intervals for the average travel time of dr. B by bicycle on the route between the university and home based on sample data. data

for any trip,
for the trip to the university,
for the return trip from the university.

Literature

Aczel, A. D., and J. Sounderpandian. 2018. Statystyka w Zarządzaniu. PWN. https://ksiegarnia.pwn.pl/Statystyka-w-zarzadzaniu,731934758,p.html.

In real-life applications, it is advisable to consult a statistician or econometrician, as improper application of any technique may lead to incorrect results.↩︎