19 Analysis of variance (ANOVA)

19.1 One-way analysis of variance – test

One-way analysis of variance can be considered a test in which we test the equality of means in two or more populations. It is an extension of the t test for two means (16), in the version where we assume equal variances.

The null hypothesis in analysis of variance states that the means in all the populations being compared (\(r\) populations) are equal:

\[ H_0: \mu_1 = \mu_2 = ... = \mu_r \]

The alternative hypothesis states that not all means are equal.
The test statistic follows an F-distribution. onceptually, the test statistic represents the ratio of variation between the group means to variation within the groups (the precise formula will be provided in a following section).

\[ F = \frac{\text{Variation between groups}}{\text{Variation within grouips}} \]

Critical region

The critical region in the test is right-tailed. As in the chi-square test, the alternative hypothesis can be said to be "multi-sided."

Założenia.

ANOVA comes with the following assumptions:

the samples are random and independent,
the observed variable in each population is (approximately) normally distributed,
the means in each population may differ (equal if \(H_0\) is true), but variances in each group are the same (\(\sigma_1^2 = \sigma_2^2 = ... = \sigma_r^2\)).

In practice, it is often assumed that ANOVA with the F test can be performed if the data are not extremely asymmetric (no extreme outliers), the observed standard deviations do not differ significantly (e.g., maximum < 2 × minimum), and the samples are sufficiently large.

If there are suspicions that the assumptions are not met, even approximately, a nonparametric Kruskal-Wallis test, described in section 20, may be applied.

19.2 F statistic – formula

The test statistic is based on the observation that total variability (calculated as the sum of squared deviations from the overall mean) can be broken down into within-group variability (the sum of squared deviations from group means) and between-group variability (the sum of squared deviations of group means from the overall mean):

\[ SS_{Total} = SSTR + SSE \]

19.2.1 Between-group variability

The between-group variability (sum of squares for treatments) is denoted by SSTR (SS – sum of squares, TR – treatment, referring to the individual groups¹¹). Other notations for between-group variability include SST, SS Groups, \(SS_{\text{between}}\), etc.

Formula:

\[ SSTR = \sum_{i=1}^r n_i (\bar{x}_i-\bar{x})^2 \tag{19.1} \] where \(r\) is the number of groups, \(n_i\) is the sample size of the \(i\)-th group, \(\bar{x}_i\) is the mean of the \(i\)-th group, and \(\bar{x}\) is the overall mean.

The mean between-group variability (MSTR, mean sum of squares due to treatment) is SSTR divided by \(r-1\) (\(r\) is the number of groups):

\[ MSTR = \frac{SSTR}{r-1} \tag{19.2} \]

19.2.2 Within-group variability

The within-group variability is measured using the sum of squares for errors (SSE):

\[ SSE = \sum_{j=1}^{n_1}(x_{1j}-\bar{x}_1)^2 + \sum_{j=1}^{n_2}(x_{2j}-\bar{x}_2)^2+ \dots + \sum_{j=1}^{n_r}(x_{rj}-\bar{x}_r)^2 \tag{19.3} \]

where \(n_1\), \(n_2\), ... are the sample sizes of the groups, \(x_{1j}\), \(x_{2j}\) are individual observations in the \(j\)-th group, and \(\bar{x}_1\), \(\bar{x}_2\) are the observed means in the groups.

The mean square error (MSE, sometimes called error variance) is:

\[ MSE = \frac{SSE}{n-r} \tag{19.4} \]

It can be observed that:

\[ SSE = ({n_1}-1)s_1^2 +({n_2}-1)s_2^2 + ... ({n_r}-1)s_r^2 \tag{19.5} \]

Thus, MSE is the pooled variance (as in the two-mean test: 16.3).

19.2.3 F Statistic

The F statistic is the ratio of the mean between-group variability to the pooled variance:

\[ F_{(r-1,n-r)}=\frac{MSTR}{MSE} \tag{19.6} \]

The F statistic follows an F-distribution (Fisher-Snedecor distribution) with degrees of freedom \(df_1 = r-1\) and \(df_2 = n-r\).

19.3 ANOVA results table

The table presenting the results of an analysis of variance usually follows this format:

Variation source	Sum of squares (SS)	Degrees of freedom (df)	Mean square (MS)	F
Between groups	SSTR	r-1	MSTR	F = MSTR/MSE
Within groups	SSE	n-r	MSE
Total	SSTotal	n-1

19.4 Effect size

Effect size in Anova is most commonly measured using the \(\eta^2\) (eta-squared) coefficient. This coefficient can be interpreted as the "proportion of explained variance," i.e., the proportion of the total variance in the studied quantitative variable explained by between-group variability.

\[ \eta^2 = \frac{SSTR}{SSTotal} \tag{19.7} \]

An alternative option is the \(\omega^2\) (omega-squared) coefficient, which is considered a better (less biased) estimator of the proportion of explained variance in the population:

\[ \omega^2 = \frac{SSTR-(r-1)\cdot MSE}{SSTotal+MSE} \tag{19.8} \]

Sometimes, Cohen's \(f\) measure is used as an effect size measure:

\[ f = \sqrt{\frac{\eta^2}{1-\eta^2}} \tag{19.9} \]

The threshold for determining whether the observed effect is strong depends on the field and the specific problem. In the absence of other references, the following guidelines for \(\eta^2\) and \(\omega^2\) can be used: 0.01 – small effect, 0.06 – moderate effect, 0.14 – large effect. For \(f\), the thresholds are: 0.10 – small effect, 0.25 – moderate effect, 0.40 – large effect.

19.5 Post hoc procedure

If, in the analysis of variance, we reject the null hypothesis of equality of means in all populations, we may want to check which pairs of means differ significantly from each other. This is done through so-called post hoc tests. A popular test of this type is the Tukey HSD (Honestly Significant Difference) test. In a post hoc test, the means are compared pairwise, and those pairs for which the difference between the means in the sample is statistically significant are identified.

19.6 Levene's test and Brown-Forsythe test

ANOVA (analysis of variance) is used to compare means. However, we can also use (a mutation of) analysis of variance to compare variances.

Levene's test for homogeneity of variances is used when one aims to check whether the variability in the compared populations is the same ("homogeneous"). H₀ in this test states that the variances are homogeneous, while H_A states that at least one variance differs from the others.

From a computational perspective, Levene's test involves two steps:

For each observation, the absolute deviation from the group mean (or median) is calculated, and then
A regular ANOVA F-test is performed on the transformed observations.

When the absolute deviation from the group mean is calculated for each observation, the test is called Levene's test. However, when we use deviations from the group median, the test is called Levene's test with the median or Brown-Forsythe test. The Brown-Forsythe test is usually recommended as being more robust and having better properties in practical applications.

19.7 Two-way analysis of variance

19.8 Links

One-way ANOVA – web application: https://istats.shinyapps.io/ANOVA/

19.9 Templates

ANOVA — Google spreadsheet

ANOVA — Excel template

F distribution calculator — Google spreadsheet

F distribution calculator — Excel template

# Example data
group <- as.factor(c(rep("A",7), rep("B",7), rep("C",7), rep("D",7)))
result <- c(51, 87, 50, 48, 79, 61, 53, 82, 91, 92, 80, 52, 79, 73, 
            79, 84, 74, 98, 63, 83, 85, 85, 80, 65, 71, 67, 51, 80)
data<-data.frame(group, result)

#To view the data, you can run: View(data)            
#To write the data to a text file, you can run: write.csv2(data, "data.csv")
#To load data from a text file, you can run: read.csv2("data.csv")

# Summary table
library(dplyr)
data %>% 
  group_by(group) %>% 
  summarize(n=n(), suma = sum(result), srednia = mean(result), odch_st=sd(result), mediana = median(result)) %>%
  data.frame() -> summary_table

# ANOVA table
model<-aov(result~group, data=data)
anova_summary<-summary(model)

# Levene's test (Brown-Forsythe test)
levene_result<-car::leveneTest(result~group, data=data)

# Tukey's procedure
tukey_result<-TukeyHSD(x=model, conf.level=0.95)

print(list(`Summary` = summary_table, `ANOVA table` = anova_summary, `Levene test` = levene_result,
           `Tukey HSD` = tukey_result))

## $Summary
##   group n suma  srednia  odch_st mediana
## 1     A 7  429 61.28571 15.56400      53
## 2     B 7  549 78.42857 13.45185      80
## 3     C 7  566 80.85714 10.76148      83
## 4     D 7  499 71.28571 11.61485      71
## 
## $`ANOVA table`
##             Df Sum Sq Mean Sq F value Pr(>F)  
## group        3   1620   539.8   3.204 0.0412 *
## Residuals   24   4043   168.5                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## $`Levene test`
## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value Pr(>F)
## group  3  0.1898 0.9023
##       24               
## 
## $`Tukey HSD`
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = result ~ group, data = data)
## 
## $group
##          diff        lwr       upr     p adj
## B-A 17.142857  -1.996412 36.282127 0.0905494
## C-A 19.571429   0.432159 38.710698 0.0437429
## D-A 10.000000  -9.139270 29.139270 0.4870470
## C-B  2.428571 -16.710698 21.567841 0.9849136
## D-B -7.142857 -26.282127 11.996412 0.7340659
## D-C -9.571429 -28.710698  9.567841 0.5237024

import pandas as pd
import numpy as np
from scipy import stats
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.multicomp import pairwise_tukeyhsd
from statsmodels.stats.anova import anova_lm

# Data frame
group = ['A']*7 + ['B']*7 + ['C']*7 + ['D']*7
result = [51, 87, 50, 48, 79, 61, 53,
          82, 91, 92, 80, 52, 79, 73, 
          79, 84, 74, 98, 63, 83, 85, 
          85, 80, 65, 71, 67, 51, 80]
data = pd.DataFrame({'group': group, 'result': result})

# Summary table
summary_table = data.groupby('group')['result'].agg(['count', 'sum', 'mean', 'std', 'median']).reset_index()
summary_table.columns = ['group', 'n', 'sum', 'mean', 'sd', 'median']

# ANOVA
model = ols('result ~ group', data=data).fit()
anova_summary = anova_lm(model, typ=2)

# Levene's test (Brown-Forsythe test)
levene_result = stats.levene(data['result'][data['group'] == 'A'],
                             data['result'][data['group'] == 'B'],
                             data['result'][data['group'] == 'C'],
                             data['result'][data['group'] == 'D'])

# Tukey's HSD
tukey_result = pairwise_tukeyhsd(endog=data['result'], groups=data['group'], alpha=0.05)

print('Summary:\n\n', summary_table, '\n\nANOVA:\n\n', anova_summary, '\n\nLevene\'s test:\n\n', levene_result,
       '\n\nTukey HSD:\n\n', tukey_result)

## Summary:
## 
##    group  n  sum       mean         sd  median
## 0     A  7  429  61.285714  15.564000    53.0
## 1     B  7  549  78.428571  13.451854    80.0
## 2     C  7  566  80.857143  10.761483    83.0
## 3     D  7  499  71.285714  11.614851    71.0 
## 
## ANOVA:
## 
##                 sum_sq    df         F    PR(>F)
## group     1619.535714   3.0  3.204282  0.041204
## Residual  4043.428571  24.0       NaN       NaN 
## 
## Levene's test:
## 
##  LeveneResult(statistic=0.18975139523084725, pvalue=0.9023335775328473) 
## 
## Tukey HSD:
## 
##   Multiple Comparison of Means - Tukey HSD, FWER=0.05 
## =====================================================
## group1 group2 meandiff p-adj   lower    upper  reject
## -----------------------------------------------------
##      A      B  17.1429 0.0905  -1.9964 36.2821  False
##      A      C  19.5714 0.0437   0.4322 38.7107   True
##      A      D     10.0  0.487  -9.1393 29.1393  False
##      B      C   2.4286 0.9849 -16.7107 21.5678  False
##      B      D  -7.1429 0.7341 -26.2821 11.9964  False
##      C      D  -9.5714 0.5237 -28.7107  9.5678  False
## -----------------------------------------------------

19.10 Exercises

Exercise 19.1 (McClave and Sincich 2012) A psychologist assessed three methods of remembering names under controlled conditions. A sample of 139 students was randomly divided into three groups; each group learned each other's names in a different way. Group 1 used a "simple game," where the first person says their name, the second person says their name and the first person's name, the third person says their name and the names of the first two, etc. Group 2 used a "complex game", a modification of the simple game, where the students not only say their names but also their favorite activities (e.g., sports). Group 3 used "pair introductions": each student must introduce the other person in the pair. A year later, all participants were sent group photos and asked to provide the names of the people in the photo. The researchers measured the percentage of names remembered by each respondent. Perform an analysis of variance to determine whether the average percentage of remembered names differs between the three memory methods. Use \(\alpha = 0.05\).

Simple game: 24, 43, 38, 65, 35, 15, 44, 44, 18, 27, 0, 38, 50, 31, 7, 46, 33, 31, 0, 29, 0, 0, 52, 0, 29, 42, 39, 26, 51, 0, 42, 20, 37, 51, 0, 30, 43, 30, 99, 39, 35, 19, 24, 34, 3, 60, 0, 29, 40, 40

Complex game: 39, 71, 9, 86, 26, 45, 0, 38, 5, 53, 29, 0, 62, 0, 1, 35, 10, 6, 33, 48, 9, 26, 83, 33, 12, 5, 0, 0, 25, 36, 39, 1, 37, 2, 13, 26, 7, 35, 3, 8, 55, 50

Pairs: 5, 21, 22, 3, 32, 29, 32, 0, 4, 41, 0, 27, 5, 9, 66, 54, 1, 15, 0, 26, 1, 30, 2, 13, 0, 2, 17, 14, 5, 29, 0, 45, 35, 7, 11, 4, 9, 23, 4, 0, 8, 2, 18, 0, 5, 21, 14

Exercise 19.2 (Agresti, Franklin, and Klingenberg 2016) An airline manages a customer service hotline through which clients make reservations. Sometimes, when the number of agents is too low relative to customer demand, some callers are placed on hold. The airline conducted an experiment to determine if what the customer hears in the background while waiting influences their willingness to wait patiently. In the experiment, the airline randomly placed one in every thousand calls on hold. In this mode, some callers heard advertisements for current promotions, others heard functional music (called "elevator music" or "muzak"), and still others heard classical music. For each caller involuntarily participating in the experiment, the time until they terminated the call was recorded:

Advertisements: 5; 1; 11; 2; 8

Muzak: 0; 1; 4; 6; 3

Classical music: 13; 9; 8; 15; 7

Using an ANOVA test, check if there is a relationship between what callers hear in the background and their time spent patiently waiting for the call to be answered.

Exercise 19.3 Dr B conducted an experiment where he commuted to work (lectures and laboratories on mathematical statistics at the Gdańsk University of Technology) on one of three randomly selected bicycles, measuring the travel time.

Data

In a two-way analysis of variance (with or without interaction), check if the individual factors (direction: from PG / to PG and bicycle) might have affected the travel time. Is there an interaction between the bicycle and the direction of travel? What assumptions should have been made?

Literature

Agresti, Alan, Christine Franklin, and Bernhard Klingenberg. 2016. Statistics: The Art and Science of Learning from Data. 4th edition. Pearson.

McClave, J. T., and T. T. Sincich. 2012. Statistics. Pearson Education. https://books.google.pl/books?id=gcYsAAAAQBAJ.

The term "treatment" used to refer to groups in ANOVA comes from medical/experimental design terminology, where different groups are subjected to various therapies/treatments↩︎