18 Chi-square tests

18.1 Applications of the chi-square tests

The chi-square test follows a similar structure each time, but serves many purposes. We will use the following applications of the chi-square test:

  • goodness-of-fit tests: we check whether the distribution in a certain population is consistent with a theoretical distribution),

  • homogeneity tests: we check whether the structure of qualitative (categorical) variables is the same, i.e., homogeneous, in two or more populations,

  • independence tests we check whether two qualitative variables are independent of each other in the population.

18.2 Formula

Although the test has various applications, we will see that the formula used to calculate the test statistic is similar in all cases. We can express it as follows:

\[ \chi^2 = \sum_{i} \frac{(O_i-E_i)^2}{E_i}, \tag{18.1} \] where:

\(i\) – is the index indicating the individual categories or the "cells" in a contingency table,

\(O_i\) - observed (actual) frequencies

\(E_i\) - expected frequencies determined based on the assumed distribution or on the assumption of independence, also known as theoretical frequencies.

18.3 Hipotheses

In the goodness-of-fit test, \(H_0\) states that the distribution of a qualitative variable in the population matches the assumed distribution, while \(H_A\) states that it is different.

In the homogeneity test, \(H_0\) states that the distribution of the qualitative variable in two or more populations is the same ("homogeneous"), while \(H_A\) states that it is different.

In the independence test, \(H_0\) states that two qualitative variables are independent in the population, while \(H_A\) states that they are dependent.

18.4 Expected frequencies

In the goodness-of-fit test, the expected frequencies arise from the assumed distribution.

In the homogeneity and independence tests, the expected frequencies must be calculated. In both tests, the frequency tables take the form of a contingency table. The expected frequency must be calculated for each cell in this table.

\[ \text{Expected frequency} = \frac{\text{Row Total} \times \text{Column Total}}{\text{Total}}\]

18.5 Conditions for test application

The chi-square test is an approximate test (in this regard, it is similar to the z test for means or proportions) and, as such, has specific requirements. In addition to the standard assumptions of randomness and sample representativeness, it also requires an adequate sample size. A common guideline is that the minimum expected frequency in each cell should be at least 5 to ensure the validity of the test.

18.6 Degrees of freedom

The test statistic in the chi-square test follows a chi-square distribution with a specified "degree of freedom" (this is the only parameter of the chi-square distribution).

  • In goodness-of-fit tests with a specific probability distribution, the degrees of freedom are \(k-1\), where \(k\) is the number of classes (categories).

  • In goodness-of-fit tests with a distribution with specific parameters, when these parameters are estimated from the same data, the degrees of freedom are \(k-m-1\), where \(k\) is the number of categories, and \(m\) is the number of estimated parameters.

  • In homogeneity tests, the degrees of freedom are \(k-1\) if one is testing two populations. For more populations, the degrees of freedom are \((k-1)\times(c-1)\), where \(c\) is the number of populations.

  • In independence tests, the degrees of freedom are \((r-1)\times(c-1)\), where \(r\) is the number of rows in the contingency table (the number of categories of one variable), and \(c\) is the number of columns (the number of categories of the second variable).

18.7 Rejection region

In all chi-square tests mentioned in this chapter, the rejection region is right-tailed, i.e., we reject the null hypothesis if the test statistic exceeds the critical value.

Note! In the case of the chi-square test, we do not refer to the alternative hypothesis as "left-tailed", "right-tailed", or "two-tailed" – the alternative hypothesis is rather "multi-tailed".

18.8 Chi-square tests versus proportion tests

It can be noticed that chi-square goodness-of-fit and homogeneity tests are extensions of proportion tests. They can include more than two categories or (in the case of homogeneity tests) more than two populations.

In goodness-of-fit tests, the extension lies in the fact that the test for one proportion limits us to a binary distribution (a distribution with only two categories), while the goodness-of-fit test can be applied for more than two categories. The two-tailed test for one proportion is practically equivalent to the chi-square goodness-of-fit test for a dichotomous variable.9

In homogeneity tests, the extension may involve examining more than two populations (a test for 3, 4, 5, 6 proportions) or increasing the number of categories, or both. The two-tailed test for two proportions is practically equivalent to the chi-square homogeneity test for two populations and a binary variable.10

18.9 Effect size in independence tests

For contingency tables showing frequencies in an independence test, several measures of effect size have been proposed. The most commonly used measure is Cramér's V:

\[ V = \sqrt{\frac{\chi^2}{n \times \text{min}(c-1, r-1)}} \tag{18.2} \]

For 2×2 tables, the phi coefficient is used, which in absolute terms is numerically equivalent to Cramér's V but can also take negative values.

18.11 Templates

Spreadsheets

Test chi-kwadrat — Google spreadsheet

Test chi-kwadrat — Excel template

Chi-square distribution calculator — Google spreadsheet

R code

# Test niezależności/jednorodności chi-kwadrat

# Macierz z danymi (wektor wejściowy)
m <- c(
  21, 14,
  3, 10
)

# Liczba wierszy w macierzy
nrow <- 2

# Poziom istotności
alpha <- 0.05

# Przekształcenie wektora w macierz
m <- matrix(data=m, nrow=nrow, byrow=TRUE)

# Test chi-kwadrat bez poprawki Yatesa
test_chi <- chisq.test(m, correct=FALSE)

# Test chi-kwadrat z poprawką Yatesa w przypadku tabel 2x2
test_chi_corrected <- chisq.test(m)

# Test G
test_g <- AMR::g.test(m)

# Dokładny test Fishera
exact_fisher<-fisher.test(m)

print(c('Liczba stopni swobody' = test_chi$parameter, 
        'Wartość krytyczna' = qchisq(1-alpha, test_chi$parameter), 
        'Statystyka chi^2' = unname(test_chi$statistic),
        'Wartość p (test chi-kwadrat)' = test_chi$p.value,
        'V Cramera' = unname(sqrt(test_chi$statistic/sum(m)/min(dim(m)-1))),
        'Współczynnik fi (dla tabel 2x2)' = if(all(dim(m)==2)) {psych::phi(m, digits=10)},
        'Statystyka chi^2 z poprawką Yatesa' = unname(test_chi_corrected$statistic),
        'Wartość p (test chi^2 z poprawką Yatesa)' = test_chi_corrected$p.value,
        'Statystyka G' = unname(test_g$statistic),
        'Wartość p (test G)' = test_g$p.value,
        'Wartość p (test dokładny Fishera)' = exact_fisher$p.value
))
##                 Liczba stopni swobody.df                        Wartość krytyczna                         Statystyka chi^2 
##                               1.00000000                               3.84145882                               5.16923077 
##             Wartość p (test chi-kwadrat)                                V Cramera          Współczynnik fi (dla tabel 2x2) 
##                               0.02299039                               0.32816506                               0.32816506 
##       Statystyka chi^2 z poprawką Yatesa Wartość p (test chi^2 z poprawką Yatesa)                             Statystyka G 
##                               3.79780220                               0.05131990                               5.38600494 
##                       Wartość p (test G)        Wartość p (test dokładny Fishera) 
##                               0.02029889                               0.04899141
# Test zgodności chi-kwadrat

# Liczebności rzeczywiste:
observed <- c(70, 10, 20)

# Liczebności oczekiwane:
expected <- c(80, 10, 10)

# Ewentualna korekta liczebności oczekiwanych, żeby ich suma była na pewno równa sumie rzeczywistych:
expected <- expected / sum(expected) * sum(observed)

# Poziom istotności
alpha <- 0.05

test_chi <- chisq.test(x = observed, p = expected, rescale.p = TRUE)
test_g <- AMR::g.test(x = observed, p = expected, rescale.p = TRUE)


print(c('Liczba stopni swobody' = test_chi$parameter, 
        'Wartość krytyczna' = qchisq(1-alpha, test_chi$parameter), 
        'Statystyka chi^2' = unname(test_chi$statistic),
        'Wartość p (test chi-kwadrat)' = test_chi$p.value,
        'Statystyka G' = unname(test_g$statistic),
        'Wartość p (test G)' = test_g$p.value
))
##     Liczba stopni swobody.df            Wartość krytyczna             Statystyka chi^2 Wartość p (test chi-kwadrat) 
##                  2.000000000                  5.991464547                 11.250000000                  0.003606563 
##                 Statystyka G           Wartość p (test G) 
##                  9.031492255                  0.010935443

Python code

# Test niezależności/jednorodności chi-kwadrat
import numpy as np
import scipy.stats as stats
from scipy.stats import chi2
from statsmodels.stats.contingency_tables import Table2x2

# Dane (macierz):
m = np.array([
  [21, 14],
  [3, 10]
])

alpha = 0.05

# Test chi-kwadrat bez poprawki Yatesa:
test_chi = stats.chi2_contingency(m, correction=False)

# Test chi-kwadrat z poprawką Yatesa:
test_chi_corrected = stats.chi2_contingency(m)

# Test G:
g, p, dof, expected = stats.chi2_contingency(m, lambda_="log-likelihood")

# Dokładny test Fishera:
exact_fisher = stats.fisher_exact(m)

# V Cramera:
cramers_v = np.sqrt(test_chi[0] / m.sum() / min(m.shape[0]-1, m.shape[1]-1))

# Współczynnik fi dla tabeli 2x2:
phi_coefficient = None
if m.shape == (2, 2):
    phi_coefficient = cramers_v*np.sign(np.diagonal(m).prod()-np.diagonal(np.fliplr(m)).prod())

# Wyniki
results = {
    'Liczba stopni swobody': test_chi[2],
    'Wartość krytyczna': chi2.ppf(1-alpha, test_chi[2]),
    'Statystyka chi-kwadrat': test_chi[0],
    'p-value (test chi-kwadrat)': test_chi[1],
    "V Cramera": cramers_v,
    'Współczynnik fi (dla tabeli 2x2)': phi_coefficient,
    'Statystyka chi-kwadrat z poprawką Yatesa': test_chi_corrected[0],
    'p-value (test chi-kwadrat z poprawką Yatesa)': test_chi_corrected[1],
    'Statystyka G': g,
    'p-value (test G)': p,
    'p-value (dokładny test Fishera)': exact_fisher[1]
}

for key, value in results.items():
    print(f"{key}: {value}")
## Liczba stopni swobody: 1
## Wartość krytyczna: 3.841458820694124
## Statystyka chi-kwadrat: 5.169230769230769
## p-value (test chi-kwadrat): 0.022990394092464842
## V Cramera: 0.3281650616569468
## Współczynnik fi (dla tabeli 2x2): 0.3281650616569468
## Statystyka chi-kwadrat z poprawką Yatesa: 3.7978021978021976
## p-value (test chi-kwadrat z poprawką Yatesa): 0.05131990358807137
## Statystyka G: 3.9106978537750194
## p-value (test G): 0.04797967015430134
## p-value (dokładny test Fishera): 0.048991413058947844
# Test zgodności chi-kwadrat

from scipy.stats import chisquare, chi2
import numpy as np

# Liczebności rzeczywiste:
observed = np.array([70, 10, 20])
# Liczebności oczekiwane:
expected = np.array([80, 10, 10])

# Ewentualna korekta liczebności oczekiwanych, żeby ich suma była na pewno równa sumie rzeczywistych:
expected = expected / expected.sum() * observed.sum()

# Test chi-kwadrat:
chi_stat, chi_p = chisquare(f_obs=observed, f_exp=expected)

# Liczba stopni swobody:
df = len(observed) - 1

# Poziom istotności:
alpha = 0.05  

# Wartość krytyczna:
critical_value = chi2.ppf(1 - alpha, df)

# Test G:
from scipy.stats import power_divergence
g_stat, g_p = power_divergence(f_obs=observed, f_exp=expected, lambda_="log-likelihood")

# Wyniki

results = {
    'Liczba stopni swobody': df,
    'Wartość krytyczna': critical_value,
    'Statystyka chi^2': chi_stat,
    'Wartość p (test chi-kwadrat)': chi_p,
    'Statystyka G': g_stat,
    'Wartość p (test G)': g_p
}

for key, value in results.items():
    print(f"{key}: {value}")
## Liczba stopni swobody: 2
## Wartość krytyczna: 5.991464547107979
## Statystyka chi^2: 11.25
## Wartość p (test chi-kwadrat): 0.0036065631360157305
## Statystyka G: 9.031492254964643
## Wartość p (test G): 0.010935442847719828

18.12 Exercises

Exercise 18.1 A sample of bank customers was drawn, and they were asked about their income and the banking product they were most interested in. The following results were obtained:

Income Loans Deposits Investments Total
Low 34 18 10 62
Medium 19 30 21 70
High 20 18 31 69
Total 73 66 62 201

Is the type of interest in the bank's product offering independent of income level?

Exercise 18.2 (Aczel and Sounderpandian 2018) A certain study describes the analysis of 35 key product categories. At the time of this study, 72.9% of the products sold were national brand products, 23% were private-label products, and 4.1% were no-name products. Suppose we want to test whether this distribution is still valid in today’s market. We collect a random sample of 1000 products from the 35 analyzed categories and find that 610 items are national brand products, 290 are private-label products, and 100 are no-name products. Conduct a test and formulate conclusions.

Exercise 18.3 A total of 200 rolls of a six-sided die were performed. The results were: 38 ones, 35 twos, 35 threes, 27 fours, 31 fives, and 34 sixes. Using the chi-square test, verify the hypothesis that the die is "fair," i.e., properly balanced. (If you have a die at hand, you can test it instead.)

Exercise 18.4 Benford's Law describes the probability distribution of the first digit appearing in many empirical datasets.

According to Benford's Law, the digit one appears as the first digit in \(\log_{10}(1+1/1) \approx 30.1%\) of cases, two appears in \(\log_{10}(1+1/2) \approx 17.6%\) of cases, three in \(\log_{10}(1+1/3) \approx 12.5%\), and so on.

The goal is to verify whether the distribution of the first digits in the annual salary data (in British pounds) of football players in different leagues (Bundesliga, La Liga, Ligue 1, Premier League, and Serie A, separately for each league) conforms to Benford's distribution.

Data: https://drive.google.com/file/d/1OPDlBv0yR-aUwYGYcnDGAffFhK2iUvD5/view?usp=share_link

Data collected by Edd Webster

Exercise 18.5 (Aczel and Sounderpandian 2018) It is believed that the returns from a certain investment follow a normal distribution with a mean of 11% (annualized rate) and a standard deviation of 2%. A brokerage firm wants to test the null hypothesis that this statement is true and has collected the following return data (assumed to be a random sample): 8.0; 9.0; 9.5; 9.5; 8.6; 13.0; 14.5; 12.0; 12.4; 19.0; 9.0; 10.0; 10.0; 11.7; 15.0; 10.1; 12.7; 17.0; 8.0; 9.9; 11.0; 12.5; 12.8; 10.6; 8.8; 9.4; 10.0; 12.3; 12.9; 7.0. Conduct an analysis using the chi-square test with six intervals and formulate a conclusion.

Exercise 18.6 (Aczel and Sounderpandian 2018) Using the data from the previous exercise, test the null hypothesis that investment returns follow a normal distribution but with an unknown mean and an unknown standard deviation. Only test the validity of the normality assumption. How does this test differ from the one in the previous exercise?

Exercise 18.7 (Aczel and Sounderpandian 2018) As markets become increasingly international, many companies invest in research to determine the maximum potential sales range in foreign markets. An American coffee machine manufacturer wants to verify whether its market share and the shares of its two competitors are roughly the same in three European countries where all three companies export their products. The results of a market study are presented in the table below. The data come from random samples of 150 consumers in each country.

France England Spain Total
Company 55 38 24 117
Competitor 1 28 30 21 79
Competitor 2 20 18 31 69
Others 47 64 74 185
Total 150 150 150 450

Exercise 18.8 Using the chi-square test and data from exercise 15.1, test the null hypothesis that the distribution of head tilts during kissing follows the parameters \(p_R = 0.5\) and \(p_L = 0.5\) (\(p_R\) is the probability of tilting the head to the right, \(p_L\) is the probability of tilting the head to the left). What one-proportion test corresponds to the conducted chi-square test? Compare the p-value results.

Exercise 18.9 In 1972, 48 bank managers were given identical personnel files. Each of them was asked to assess whether the individual should be promoted to branch manager or whether other candidates should be interviewed. The files were identical, except that in half of the cases, the person was identified as a woman, and in the other half, as a man. Among the 24 "male" cases, promotion was recommended in 21 instances; among the 24 "female" cases, promotion was recommended in 14 (Rosen and Jerdee 1974). Solve the problem using the two-sided difference in proportions test and compare the results (test statistics, p-values) with the chi-square test.

Exercise 18.10 Return to the data from exercise 15.5. Test whether the hypothesis that the distribution of lost keys in the population (i.e., the data-generating process) is uniform (discrete uniform).

Exercise 18.11 The table below presents data on bomb hits in the southern part of London during World War II. The entire area was divided into 576 square sections, each covering 1/4 km². The bombing data is presented in the table below (Clarke 1946). The symbol nk represents the number of squares that were hit by exactly k bombs.

k 0 1 2 3 4 5 6 7
nₖ 229 211 93 35 7 0 0 1

Based on the given data, can the hypothesis that the London bombings were random and did not concentrate on any specific area be rejected?

Literature

Aczel, A. D., and J. Sounderpandian. 2018. Statystyka w Zarządzaniu. PWN. https://ksiegarnia.pwn.pl/Statystyka-w-zarzadzaniu,731934758,p.html.
Clarke, R. D. 1946. “An Application of the Poisson Distribution.” Journal of the Institute of Actuaries 72 (3): 481–81. https://doi.org/10.1017/S0020268100035435.
Rosen, Benson, and Thomas H. Jerdee. 1974. “Influence of Sex Role Stereotypes on Personnel Decisions.” Journal of Applied Psychology 59: 9–14. https://doi.org/10.1037/h0035834.

  1. If the distribution is binary (follows a Bernoulli distribution, eg. yes/no, disease/no disease etc.) and we compare it with some theoretical distribution, the test for 1 proportion can be used (15), as well as the goodness-of-fit test (compare exercises 15.1 i 18.8).↩︎

  2. If we compare two proportions, the test for 2 propotions can be used (17) or the chi-square homogeneity test – cf. exercises 17.2 and 18.9.↩︎