14 Hypothesis testing — one mean

Discussing statistical tests requires the introduction of specific terms and the presentation of various topics. Therefore, in this and subsequent chapters, alongside the introduction of specific statistical tests, individual terms will be presented and specific issues discussed: hypotheses (14.1), test statistics and rejection region (14.2), Type I and Type II errors (14.3), statements regarding the non-rejection of the null hypothesis (14.4), the structure of a statistical test (14.5), types of alternative hypotheses (14.6), notation of null hypotheses with non-strict inequalities (14.7), p-value (15.2), power of a test (15.3), the relationship between confidence intervals and tests (16.6), and effect size (16.7).

14.1 Hypotheses and statistical tests

A statistical hypothesis is a hypothesis about the numerical value of a population/process parameter.

Sometimes hypotheses may also refer to the distributions of variables in the population/process.

The null Hypothesis (\(H_0\)) is a statistical hypothesis that one accepts until it can concluded otherwise based on reasonable evidence based on the data.

The null hypothesis may reflect the

  • ‘status quo’,

  • some specific value of a population parameter that we want to test,

  • an assumption of no effect (e.g. ‘correlation is zero’).

The alternative Hypothesis (\(H_A\)) is a hypothesis that we will only accept if the data provide sufficient evidence that it may be true.

The alternative hypothesis may reflect, for example, a situation in which we conclude that there is some effect or some relationship in the data (e.g. ‘the correlation is different from zero’, ‘the correlation is greater than zero’).

There are two possible results of a null-hypothesis statistical test:

  • the rejection of the null hypothesis in favour of the alternative hypothesis,

  • or conclusion that there are no sufficient grounds to reject the null hypothesis.

14.2 Test statistic and rejection region

A Test statistic is a statistic calculated from a sample to decide whether or not to reject the null hypothesis. A test statistic often takes its symbol from its distribution, e.g. \(z\) (standard normal distribution), \(t\) (_t_Student's distribution), and \(chi^2\) (chi-square distribution).

Rejection region of a statistical test consists of the values of the test statistic that lead the researcher to reject the null hypothesis in favour of the alternative hypothesis. The rejection region is also referred to as the critical region or critical set.

14.3 Errors

The result of a statistical test will not always give the correct result. When deciding whether to accept or reject the null hypothesis on the basis of a sample, we may (usually unknowingly) make one of two types of errors:

Type I error occurs when the null hypothesis is rejected, even though it is actually true. In other words, a Type I error involves accepting a false alternative hypothesis.

Type II error occurs when the null hypothesis is not rejected, even though it is actually false.

The probability of making a Type I error is often denoted by the Greek letter alpha (\(\alpha\)), while the probability of making a Type II error is represented by the letter \(\beta\).

14.4 Do we accept the null hypothesis?

Note! In statistics, we typically use cautious language when discussing hypotheses. It is important to remember not to speak of ‘accepting \(H_0\)’ based on test results. Since we usually do not have enough information to determine the probability of a Type II error (\(\beta\)), we carefully state that if the test statistic does not fall within the rejection region, we conclude that there is no basis (or: no sufficient evidence) to reject the null hypothesis.

14.5 Elements of a typical statistical test

  1. Null hypothesis (\(H_0\)) – A statement specifying a particular value of a parameter that characterizes one or more populations. The null hypothesis usually describes the ‘status quo,’ meaning it represents what we assume to be true until the data provide evidence to the contrary. The null hypothesis always includes an equality sign, i.e. \(H_0: \text{parameter}=\text{value}\). The \(\text{value}\) in the null hypothesis is sometimes referred to as null value.

  2. Alternative hypothesis (\(H_A\)) – The hypothesis that we accept only if the obtained data provide sufficient evidence to reject the null hypothesis. The alternative hypothesis contains an inequality, which may take one of the following forms: \(H_A: \text{parameter}\ne\text{value}\), or: \(H_A: \text{parameter}>\text{value}\) or: \(H_A: \text{parameter}<\text{value}\). The alternative hipothesis contains the same null value as the null hypothesis.

  3. Significance Level (\(\alpha\)) – The acceptable probability of rejecting the null hypothesis when it is actually true (i.e., the conditional probability of making a Type I error). A commonly used standard value is \(\alpha = 0.05\).

\[ \alpha = P (\text{rejection of }H_0\:|\:H_0 \text{ is true}) \]

  1. Test statistic – A statistic calculated from the sample, which helps determine whether there is sufficient evidence to reject the null hypothesis.

  2. Rejection region – Also known as the critical region or critical set, this is the set of test statistic values that lead to rejecting the null hypothesis in favor of the alternative hypothesis. The critical region is determined based on the distribution of the test statistic, the chosen significance level \(\alpha\), and the form of the alternative hypothesis. The boundary of the rejection region is often referred to as the critical value.

  3. Assumptions and conditions – Tests assume that the collected samples come from the studied population, typical tests assume also that the samples have been drawn completely randomly. Some tests have additional assumptions, such as requirements about the distribution of the studied variable within the population. For many tests, it is also important to ensure an adequate sample size.

  4. Conducting the test – Collecting a sample and obtaining the relevant sample statistics.

  5. Conclusion – Depending on the test statistic's result, the statistical test leads to:

  1. rejecting the null hypothesis in favor of the alternative hypothesis – if the test statistic falls within the rejection region, or

  2. stating that there is no sufficient evidence to reject the null hypothesis.

If we reject the null hypothesis, we say or write: "At a significance level of \(\alpha\) (insert appropriate value), we reject the null hypothesis in favour of the alternative hypothesis."

If we do not reject the null hypothesis, we state: "The sample does not provide sufficient evidence to reject the null hypothesis at a significance level of \(\alpha\)."

14.6 Types of alternatives

Depending on the context, the alternative hypothesis in many tests (e.g. in tests for one and two means, in tests for one or two proportions) can take three different forms:

  • two-sided/two-tailed (\(H_A: \text{parameter } \ne \text{value}\)),

  • left-sided/left-tailed (\(H_A: \text{parameter } < \text{value}\))

  • right-sided/right-tailed (\(H_A: \text{parameter } > \text{value}\))

The form of the hypothesis depends on what we are investigating and why. This is best discussed with examples of specific tasks being solved. It is important to note that decisions about the form of an alternative hypothesis cannot be made on the basis of sample results. Ideally, hypotheses should be established before the data are obtained, or as if the specific data have not yet been seen.

The form of the hypothesis depends on what we are investigating and why. It is best to discuss this using specific problem examples. Importantly, the decision about the form of the alternative hypothesis cannot be made based on the sample results. Ideally, hypotheses should be established before data collection or as if the specific data had not yet been seen.

Tests in which the alternative hypothesis is left- or right-tailed are called one-tailed (or: one-sided) tests. A two-tailed hypothesis can also be referred to as a two-sided hypothesis.

The rejection region depends on whether the hypothesis is two-sided, right-tailed, or left-tailed. As we will see, if the alternative hypothesis is right- or left-tailed, the critical region is also right- or left-tailed. If the alternative hypothesis is two-tailed, the critical region is also two-sided.

For some tests, such as the chi-squared test or analysis of variance (ANOVA), the terms “one/two-sided” are not used for the alternative hypothesis. This is because — as can be seen when studying these tests — they are rather “multi-sided” in nature.

14.7 Notation of null hypotheses with non-strict Inequalities

Note! In the case of one-tailed tests, the null hypothesis is sometimes written using a non-strict inequality. That is, if the alternative hypothesis is:

\[H_A: \text{parameter } > \text{value},\]

then the null hypothesis can be written as:

\[H_0: \text{parametr } \leq \text{wartość}\]

Conversely, if \(H_A: \text{parameter } < \text{value}\), then \(H_0: \text{parameter } \geq \text{value}\). This notation, which includes a non-strict inequality, follows a certain logic — let us call it the “decision-based notation”.

The notation using an equality sign: \[H_0: \text{parameter } = \text{value}\] can be referred to as “point notation”.

In this script, we use point notation. A null hypothesis with an equality sign (point notation) is applied, for example, when defining the significance level, determining and defining the p-value, and establishing the rejection region6.

The null hypothesis written in the “decision-based” form highlights that if we do not accept the one-tailed alternative hypothesis, we are left with the remaining part of the number line, which can be described using a non-strict inequality.

14.8 One-sample mean test based on the \(z\) statistic

  1. In a one-sample mean test, the null hypothesis states that the mean of the population from which the sample is drawn equals \(\mu_0\) (i.e., a specific null value defined in the hypothesis):

\[ H_0: \mu = \mu_0 \]

  1. The alternative hypothesis must be chosen from three options:
  • In a two-tailed test:

\[H_A: \mu \ne \mu_0\]

  • In a left-tailed test:

\[H_A: \mu < \mu_0 \]

  • In a right-tailed test:

\[H_A: \mu > \mu_0 \]

  1. The test statistic \(z\):

\[z = \frac{\bar{x}-\mu_0}{\sigma / \sqrt{n}}\:\:\text{ or }\:\: z\approx\frac{\bar{x}-\mu_0}{s / \sqrt{n}}, \]

where \(\bar{x}\) is the sample mean, \(\mu_0\) is the hypothesised population mean (the null value), \(\sigma\) is the population standard deviation (when known), \(s\) is the sample standard deviation, and \(n\) is the sample size.

  1. The rejection region depends on the alternative hypothesis. If the test is two-tailed, the rejection region is \(|z|>z_{\alpha/2}\). If the test is left-tailed, the rejection region is \(z < -z_{\alpha}\), while for a right-tailed test, the rejection region is \(z > z_{\alpha}\). The critical values \(z_{\alpha}\) and \(z_{\alpha/2}\) are chosen such that \(\mathbb{P}(z>z_{\alpha})=\alpha\) and \(\mathbb{P}(|z|>z_{\alpha/2})=\alpha\).

  2. It is important to note that a key assumption of the test is that the sample is drawn from the studied population (or process). Another requirement is that the sample must be sufficiently large (in practice, \(n\geq 30\)).

Note! The test can also be used for small samples if the population standard deviation \(\sigma\) is known (which is rare but sometimes occurs) and, at the same time, the population distribution is normal (or approximately normal).

14.9 One-sample \(t\) test

  1. In this test, the null and alternative hypotheses can be formulated in the same way as in the \(z\) test.

  2. The test statistic \(t\):

\[t = \frac{\bar{x}-\mu_0}{s / \sqrt{n}}, \]

where \(\bar{x}\) is the sample mean, \(\mu_0\) is the hypothesised population mean (null value), \(s\) is the sample standard deviation, and \(n\) is the sample size.

  1. The rejection region depends on the form of the alternative hypothesis. If the test is two-tailed, the rejection region is \(|t|>t_{\alpha/2}\). If the test is left-tailed, the rejection region is \(t<-t_{\alpha}\), while for a right-tailed test, the rejection region is \(t>t_{\alpha}\). The critical values \(t_{\alpha}\) and \(t_{\alpha/2}\) are chosen such that \(\mathbb{P}(t>t_{\alpha})=\alpha\) and \(\mathbb{P}(|t|>t_{\alpha/2})=\alpha\).

To determine the critical values \(t_{\alpha}\) or \(t_{\alpha/2}\), the degrees of freedom must be known. In this test, the degrees of freedom (\(\nu\), d.f., degrees of freedom) are given by \(n-1\).

  1. The \(t\) test is applied when the sample is randomly drawn and the population distribution can be assumed to be normal (or approximately normal).

14.10 Practical use of \(t\) and \(z\) tests

In the statistics course based on this script, we use the \(t\) test for small samples when the assumption of normality of the population distribution is reasonable, while the \(z\) test is applied for large samples.

In practical statistical applications, the \(t\) test is often used even for large samples because. With a high number of degrees of freedom (large sample size \(n\)), the \(t\) test yields results similar to those of the \(z\) test. As a result, some statistical software packages do not include the \(z\) test as a standard option.

14.12 Templates

Spreadsheets

Tests for 1 mean and 1 proportion — Google spreadsheet

Tests for 1 mean and 1 proportion — Excel template

R code

# Test for 1 mean
# Sample size:
n <- 38
# Sample mean:
xbar <- 184.21
# Sample/population standard deviation:
s <- 6.1034
# Significance level:
alpha <- 0.05

# Null value for the mean:
mu0 <- 179
# Alternative (sign): "<"; ">"; "<>"; "≠"
alt <- ">"

# Computations
# degrees of freedom:
df <- n-1

# critical value (t test):
crit_t <- if (alt == "<") {qt(alpha, df)} else if (alt == ">") {qt(1-alpha, df)} else {qt(1-alpha/2, df)}

# test statistic (t/z)
test_tz <- (xbar-mu0)/(s/sqrt(n))

# p-value (t test):
p.value = if(alt == ">"){1-pt(test_tz, df)} else if (alt == ">") {pt(test_tz, df)} else {2*(1-pt(abs(test_tz),df))}

# critical value (z test):
crit_z <- if (alt == "<") {qnorm(alpha)} else if (alt == ">") {qnorm(1-alpha)} else {qnorm(1-alpha/2)}

# p-value (z test):
p.value.z = if(alt == ">"){1-pnorm(test_tz)} else if (alt == ">") {pnorm(test_tz)} else {2*(1-pnorm(abs(test_tz)))}

print(c('Mean' = xbar, 
        'SD' = s,
        'Sample size' = n,
        'Null hypothesis' = paste0('mu = ', mu0),
        'Alt. hypothesis' = paste0('mu ', alt, ' ', mu0),
        'Test statistic t/z' = test_tz,
        'Critical value (t test)' = crit_t,
        'P value (t test)' = p.value,
        'Critical value (z test)' = crit_z,
        'P value (z test)' = p.value.z
))
##                    Mean                      SD             Sample size         Null hypothesis         Alt. hypothesis 
##                "184.21"                "6.1034"                    "38"              "mu = 179"              "mu > 179" 
##      Test statistic t/z Critical value (t test)        P value (t test) Critical value (z test)        P value (z test) 
##      "5.26208293008297"      "1.68709361959626"   "3.1304551380007e-06"      "1.64485362695147"  "7.12162456784071e-08"
# Based on raw data (test t):

# Data vector
data <- c(176.5267, 195.5237, 184.9741, 179.5349, 188.2120, 190.7425, 178.7593, 196.2744, 186.6965, 187.8559, 183.1323, 176.2569, 191.4752, 186.5975, 180.2120, 184.3434, 178.1691, 184.8852, 187.7973, 178.5013, 172.7343, 176.8545, 184.2068, 181.2395, 186.1983, 173.6317, 181.9529, 185.9135, 188.6081, 183.0285, 183.3375, 188.5512, 184.6348, 186.9657, 183.9622, 200.9014, 183.5353, 177.2538)

# Storing as an object
# Choose alternative: "two-sided" (default), "less", or "greater" and the null value (default is zero)
test_result <- t.test(data, alternative = "greater", mu = 179)

# Printing test results. Single components can be printed using for example test_result$statistic.
print(test_result)
## 
##  One Sample t-test
## 
## data:  data
## t = 5.2621, df = 37, p-value = 3.13e-06
## alternative hypothesis: true mean is greater than 179
## 95 percent confidence interval:
##  182.5396      Inf
## sample estimates:
## mean of x 
##    184.21

Python code

# Test for 1 mean
from scipy.stats import t, norm
from math import sqrt

# Sample size:
n = 38

# Sample mean:
xbar = 184.21

# Sample/population standard deviation:
s = 6.1034

# Significance level:
alpha = 0.05

# Null value for the population mean:
mu0 = 179

# Alternative (sign): "<"; ">"; "<>"; "≠"
alt = ">"

# Calculations:
# degrees of freedom:
df = n - 1

# critical value (t test):
if alt == "<":
    crit_t = t.ppf(alpha, df)
elif alt == ">":
    crit_t = t.ppf(1 - alpha, df)
else:
    crit_t = t.ppf(1 - alpha / 2, df)

# test statistic (t/z)
test_tz = (xbar - mu0) / (s / sqrt(n))

# p-value (t test):
if alt == ">":
    p_value_t = 1 - t.cdf(test_tz, df)
elif alt == "<":
    p_value_t = t.cdf(test_tz, df)
else:
    p_value_t = 2 * (1 - t.cdf(abs(test_tz), df))

# critical value (z test):
if alt == "<":
    crit_z = norm.ppf(alpha)
elif alt == ">":
    crit_z = norm.ppf(1 - alpha)
else:
    crit_z = norm.ppf(1 - alpha / 2)

# p-value (z test):
if alt == ">":
    p_value_z = 1 - norm.cdf(test_tz)
elif alt == "<":
    p_value_z = norm.cdf(test_tz)
else:
    p_value_z = 2 * (1 - norm.cdf(abs(test_tz)))

results = {
    'Mean': xbar,
    'SD': s,
    'Sample size': n,
    'Null hypothesis': f'mu = {mu0}',
    'Alt. hypothesis': f'mu {alt} {mu0}',
    'Test stat. t/z': test_tz,
    'Critical value t': crit_t,
    'P-value (t test)': p_value_t,
    'Critical value z': crit_z,
    'P-value (z test)': p_value_z
}

for key, value in results.items():
    print(f"{key}: {value}")
## Mean: 184.21
## SD: 6.1034
## Sample size: 38
## Null hypothesis: mu = 179
## Alt. hypothesis: mu > 179
## Test stat. t/z: 5.262082930082973
## Critical value t: 1.6870936167109876
## P-value (t test): 3.1304551380006984e-06
## Critical value z: 1.6448536269514722
## P-value (z test): 7.121624567840712e-08
# Raw data (t test):

import scipy.stats as stats

data = [176.5267, 195.5237, 184.9741, 179.5349, 188.2120, 190.7425, 178.7593, 196.2744, 186.6965, 187.8559, 183.1323, 176.2569, 191.4752, 186.5975, 180.2120, 184.3434, 178.1691, 184.8852, 187.7973, 178.5013, 172.7343, 176.8545, 184.2068, 181.2395, 186.1983, 173.6317, 181.9529, 185.9135, 188.6081, 183.0285, 183.3375, 188.5512, 184.6348, 186.9657, 183.9622, 200.9014, 183.5353, 177.2538]

test_result = stats.ttest_1samp(data, popmean=179, alternative='greater')

print(test_result)
## TtestResult(statistic=5.262096550537936, pvalue=3.1303226590428976e-06, df=37)

14.13 Exercises

Exercise 14.1 A certain mean test is based on the statistic \(z\). The null hypothesis is \(H_0: \mu = 4000\). Specify the critical region (the set of values of the test statistic \(z\) that lead to rejecting the null hypothesis) if:

  1. The significance level is \(\alpha =0.05\), and the alternative hypothesis is \(H_A: \mu < 4000\)

  2. The significance level is \(\alpha =0.10\), and the alternative hypothesis is \(H_A: \mu < 4000\)

  3. The significance level is \(\alpha =0.10\), and the alternative hypothesis is \(H_A: \mu > 4000\)

  4. The significance level is \(\alpha =0.05\), and the alternative hypothesis is \(H_A: \mu \ne 4000\)

Exercise 14.2 The critical region is defined by the given formula. Represent the critical region on a sketch of the standard normal density function. Determine whether the critical region is left-tailed, right-tailed, or two-tailed. For each point, also determine the probability of making a Type I error.

  1. \(Z \in (-\infty; -1.645)\)

  2. \(Z < -1.96\)

  3. \(Z \in (2.326; \infty)\)

  4. \(Z > 1.282\)

  5. \(Z < -1.645 \text{ or } Z > 1.645\)

  6. \(|Z|>2.576\)

Exercise 14.3 From a certain population of employees, where the standard deviation of salaries is 4500 PLN, a sample of 64 individuals was taken, and the mean salary obtained was 21,000 PLN.

  1. Conduct a hypothesis test where the null hypothesis states that \(\mu=20,000\) PLN against the alternative hypothesis that \(\mu>20,000\) PLN. Use a significance level of \(\alpha=0.05\). Interpret the test results.

  2. Test the null hypothesis that \(\mu=20,000\) PLN against the alternative hypothesis that \(\mu \ne 20,000\) PLN. Use a significance level of \(\alpha=0.05\). Interpret the test results.

  3. Compare the results obtained in the two tests above. Explain why they differ.

Exercise 14.4 A rapid-measurement thermometer was tested by performing 50 measurements and comparing the results with those of an accurate thermometer. The differences between the readings of both devices were recorded (with the appropriate sign: plus if the rapid thermometer showed a higher temperature, or minus if it showed a lower one). The mean difference was 0.12 K, and the standard deviation was 0.08 K. Do these data allow us to conclude that, in the population, the average readings of both thermometers differ from each other?

Exercise 14.5 (Aczel and Sounderpandian 2018) A machine fills 2-litre bottles with cola. A consumer rights advocate wants to test the null hypothesis that the average volume of liquid dispensed into each bottle is at least 2000 cm³. A random sample of 40 bottles filled by the machine was taken, and their liquid volume was precisely measured. The sample mean obtained was 1999.6 cm³. Based on past experiences, the population standard deviation should be assumed to be 1.3 cm³.

  1. Test the null hypothesis at a significance level of 5%.

  2. Assume that the population follows a normal distribution with the same standard deviation \(\sigma\) of 1.3 cm³. Suppose the sample had only 20 observations, and the mean was still 1999.6 cm³. Perform the test again with \(\alpha\) equal to 0.05.

  3. If there is a difference between the results of the two tests above, explain it.

Exercise 14.6 A car manufacturer wants to determine whether a new engine has better fuel consumption parameters than the previous one. The previous engines, under controlled conditions, had an average fuel consumption of 6.12 litres per 100 km. A total of 120 trials of the new engine were conducted under analogous conditions, and a mean of 5.71 litres per 100 km was obtained, with a standard deviation of 0.56 litres per 100 km. Construct a confidence interval for the mean fuel consumption of the new engine. Test the appropriate hypothesis. Assume \(\alpha=0.05\).

Exercise 14.7 The average daily revenue in a certain local shop is 7218 PLN. Recently, the owner's son decided to apply marketing knowledge acquired at university and changed the layout of goods on the shelves while also placing a large billboard outside the shop encouraging purchases. After 18 days of operating under the new rules, the daily revenue during this period averaged 8713 PLN, with a standard deviation of 1023 PLN. Can it be concluded that the observed increase in revenue was statistically significant? What assumptions must be made to conduct the appropriate test?

Exercise 14.8 Verify the hypothesis that the average height reported by male statistics students is higher than the average in the population of people their age. According to data found online, the expected average height of 20-year-old men is 180 cm, with a standard deviation of 5 cm. In a random sample of male statistics students (\(n=25\)), an average height of 185 cm and a standard deviation of 6 cm were obtained. Assume a significance level of \(\alpha = 0.05\).

Exercise 14.9 (Aczel and Sounderpandian 2018) The Ognivex company has changed its lithium-ion battery production process. Batteries produced under the old process had an average lifespan of 102.5 hours. To determine whether the new process affects the average battery lifespan, the manufacturer collected a random sample of 25 batteries produced under the new process and used them until depletion. The mean battery life in the sample was 107 hours, with a standard deviation of 10 hours. Are these results significant at a significance level of \(\alpha = 0.05\)? Are they significant if we assume a significance level of \(\alpha = 0.01\)?

Literature

Aczel, A. D., and J. Sounderpandian. 2018. Statystyka w Zarządzaniu. PWN. https://ksiegarnia.pwn.pl/Statystyka-w-zarzadzaniu,731934758,p.html.

  1. For instance, when we state that the significance level is the acceptable probability of rejecting the null hypothesis when it is true, we are referring to the point null hypothesis.↩︎