22 Other tests

22.1 Checking conformity with the Gaussian distribution

Popular "normality tests" include the Kolmogorov-Smirnov test, the Shapiro-Wilk-Royston test, and the Jarque-Bera test.

22.1.1 Kolmogrov-Smirnov test

22.1.2 Shapiro-Wilk-Royston test

22.1.3 Jarque-Bera test

22.1.4 Anderson-Darling test

22.2 Test and confidence interval for Pearson's correlation coefficient

The correlation coefficient examines the relationship between two quantitative variables. It can be positive or negative, and takes values between -1 and 1. The sample correlation coefficient is commonly denoted by the symbol \(r\), while the population correlation coefficient is represented by the Greek letter \(\rho\) ("rho").

A popular test for the correlation coefficient is the t-test. This test usually assumes that the joint probability distribution of the two variables follows a bivariate normal distribution.

The null hypothesis in this test states that the variables in the population are not linearly correlated:

\[ H_0: \rho = 0, \]

while the alternative hypothesis (two-tailed) states that they are linearly correlated:

\[ H_A: \rho \ne 0 \]

It is also possible to apply one-tailed hypotheses: \(H_A: \rho > 0\) lub \(H_A: \rho < 0\).

The test statistic is based on the \(t\) -distribution with \(n-2\) degrees of freedom and has the following form:

\[ t = r {\sqrt {\frac {n-2}{1-r^{2}}}}, \] where \(n\) is the sample size, and \(r\) is the correlation coefficient in the sample.

The formula for the confidence interval of Pearson's correlation coefficient is quite complex. The lower bound is obtained as follows:

\[ \frac{\text{exp}\left(ln\frac{1+r}{1-r}-\frac{2}{\sqrt{n-3}}z_{\alpha/2}\right)-1}{\text{exp}\left(ln\frac{1+r}{1-r}-\frac{2}{\sqrt{n-3}}z_{\alpha/2}\right)+1} \]

and the upper (right) bound is:

\[ \frac{\text{exp}\left(ln\frac{1+r}{1-r}+\frac{2}{\sqrt{n-3}}z_{\alpha/2}\right)-1} {\text{exp}\left(ln\frac{1+r}{1-r}+\frac{2}{\sqrt{n-3}}z_{\alpha/2}\right)+1} \]

22.3 Templates

Normality check — Google spreadsheet

Normality check — Excel template

Correlation coefficient — Google spreadsheet

Correlation coefficient — Excel template

R code

# Normality check
# Data
data<-c(76, 62, 55, 62, 56, 64, 44, 56, 94, 37, 72, 85, 72, 71, 60, 61, 64, 71, 69, 70, 64, 59, 71, 65, 84)
# Sample mean
m<-mean(data)
# Sample standard deviation
s<-sd(data)
# Skewness
skew<-e1071::skewness(data, type=2)
# Kurtosis
kurt<-e1071::kurtosis(data, type=2)
# IQR/s ratio
IQR_to_s<-IQR(data)/s
# proportion of observations within 1 sd from the mean
within1s<-mean(abs(data-m)<s)
# proportion of observations within 2 sds from the mean
within2s<-mean(abs(data-m)<2*s)
# proportion of observations within 3 sd from the mean
within3s<-mean(abs(data-m)<3*s)
# Shapiro-Wilk test
SW_res<-shapiro.test(data)
# Jarque-Bera test
JB_res<-tseries::jarque.bera.test(data)
# Anderson-Darling test
AD_res<-nortest::ad.test(data)
# Kolmogorov-Smirnov test
KS_res <- ks.test(data, function(x){pnorm(x, mean(data), sd(data))})

# Results
print(c('Mean' = m, 
        'St. deviation' = s,
        'Sample size' = length(data),
        'Skewnes (~=0?)' = skew,
        'Kurtosis (~=0?)' = kurt,
        'IQR/s (~=1,3?)' = IQR_to_s,
        '68% rule' = within1s*100,
        '95% rule' = within2s*100,
        '100% rule' = within3s*100,
        'Test stat. - SW test' = SW_res$statistic,
        'P-value - SW test' = SW_res$p.value,
        'Test stat. - JB test' = JB_res$statistic,
        'P-value - JB test' = JB_res$p.value,
        'Test stat. - AD test' = AD_res$statistic,
        'P-value - AD test' = AD_res$p.value,
        'Test stat - KS test' = KS_res$statistic,
        'P-value - KS test' = KS_res$p.value
))

##                           Mean                  St. deviation                    Sample size                 Skewnes (~=0?) 
##                   65.760000000                   12.145918382                   25.000000000                   -0.002892879 
##                Kurtosis (~=0?)                 IQR/s (~=1,3?)                       68% rule                       95% rule 
##                    1.121315636                    0.905654036                   80.000000000                   92.000000000 
##                      100% rule         Test stat. - SW test.W              P-value - SW test Test stat. - JB test.X-squared 
##                  100.000000000                    0.964892812                    0.520206266                    0.479578632 
##              P-value - JB test         Test stat. - AD test.A              P-value - AD test          Test stat - KS test.D 
##                    0.786793609                    0.445513641                    0.260454333                    0.143712404 
##              P-value - KS test 
##                    0.680154210

# Plots
hist(data, prob=TRUE)
curve(dnorm(x, m, s), col="darkblue", lwd=2, add=TRUE)

qqnorm(data, pch=16)
qqline(data, col='darkblue')

#data:
x <- c(4, 8, 6, 5, 3, 8, 6, 5, 5, 6, 3, 3, 3, 7, 3, 6, 4, 4, 8, 7)
y <- c(4, 6, 8, 3, 4, 6, 5, 8, 1, 4, 4, 3, 1, 6, 7, 5, 4, 4, 9, 5)
#test and confidence interval:
cor.test(x, y, alternative = "two.sided", conf.level = 0.95)

## 
##  Pearson's product-moment correlation
## 
## data:  x and y
## t = 2.56, df = 18, p-value = 0.01968
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.09607185 0.78067292
## sample estimates:
##       cor 
## 0.5166288

Python code

import numpy as np
import math as math
from scipy import stats
from scipy.stats import skew, kurtosis, anderson, kstest, norm

data = [76, 62, 55, 62, 56, 64, 44, 56, 94, 37, 72, 85, 72, 71, 60, 61, 64, 71, 69, 70, 64, 59, 71, 65, 84]

m = np.mean(data)

s = np.std(data, ddof=1)

skew = skew(data, bias=False)

kurt = kurtosis(data, bias=False)

IQR_to_s = stats.iqr(data) / s

within1s = np.mean(np.abs(data - m) < s)

within2s = np.mean(np.abs(data - m) < 2 * s)

within3s = np.mean(np.abs(data - m) < 3 * s)

SW_res = stats.shapiro(data)

JB_res = stats.jarque_bera(data)

AD_res = stats.anderson(data)

AD2 = AD_res[0]*(1 + (.75/50) + 2.25/(50**2))

if AD2 >= .6:
    AD_p = math.exp(1.2937 - 5.709*AD2 - .0186*(AD2**2))
elif AD2 >=.34:
    AD_p = math.exp(.9177 - 4.279*AD2 - 1.38*(AD2**2))
elif AD2 >.2:
    AD_p = 1 - math.exp(-8.318 + 42.796*AD2 - 59.938*(AD2**2))
else:
    AD_p = 1 - math.exp(-13.436 + 101.14*AD2 - 223.73*(AD2**2))
    
KS_res = kstest(data, norm(loc=np.mean(data), scale=np.std(data)).cdf)

results = {
    'Mean': m, 
    'St. deviation': s,
    'Sample size': len(data),
    'Skewness (~=0?)': skew,
    'Excess kurtosis (~=0?)': kurt,
    'IQR/s (~=1.3?)': IQR_to_s,
    '68% rule': within1s * 100,
    '95% rule': within2s * 100,
    '100% rule': within3s * 100,
    'Test stat. - SW test': SW_res[0],
    'P-value - SW test': SW_res[1],
    'Test stat. - JB test': JB_res[0],
    'P-value - JB test': JB_res[1],
    'Test stat. - AD test': AD_res[0],
    'P-value - AD test': AD_p,
    'Test stat. - KS test': KS_res[0],
    'P-value - KS test': KS_res[1]
}


for key, value in results.items():
    print(f"{key}: {value}")

## Mean: 65.76
## St. deviation: 12.145918381634768
## Sample size: 25
## Skewness (~=0?): -0.002892878582795321
## Excess kurtosis (~=0?): 1.121315636006976
## IQR/s (~=1.3?): 0.9056540357320815
## 68% rule: 80.0
## 95% rule: 92.0
## 100% rule: 100.0
## Test stat. - SW test: 0.9648927450180054
## P-value - SW test: 0.5202044248580933
## Test stat. - JB test: 0.47957863171421605
## P-value - JB test: 0.7867936085428064
## Test stat. - AD test: 0.44551364110243696
## P-value - AD test: 0.27208274566708046
## Test stat. - KS test: 0.14001867878746022
## P-value - KS test: 0.6601736174333425

import matplotlib.pyplot as plt

# Histogram
plt.hist(data, density=True)
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = stats.norm.pdf(x, m, s)
plt.plot(x, p, 'k', linewidth=2, color="darkblue")
plt.show()

# Q-Q plot
qq_data = stats.probplot(data, dist="norm", plot=plt)
plt.show()

import scipy.stats as stats
import numpy as np

x = [4, 8, 6, 5, 3, 8, 6, 5, 5, 6, 3, 3, 3, 7, 3, 6, 4, 4, 8, 7]
y = [4, 6, 8, 3, 4, 6, 5, 8, 1, 4, 4, 3, 1, 6, 7, 5, 4, 4, 9, 5]

correlation_test = stats.pearsonr(x, y)

# Confidence interval:
ci_tmp = stats.norm.interval(0.95, 
  loc=0.5*math.log((1+correlation_test[0])/(1-correlation_test[0])), 
  scale=1/((len(x)-3)**0.5))
confidence_interval = (np.exp(2*np.array(ci_tmp))-1)/(np.exp(2*np.array(ci_tmp))+1)

print('Test:\n\n', correlation_test, '\n\n Confidence interval: \n\n', confidence_interval)

## Test:
## 
##  PearsonRResult(statistic=0.5166287822129882, pvalue=0.01968490591016886) 
## 
##  Confidence interval: 
## 
##  [0.09607185 0.78067292]

22.4 Zadania

Exercise 22.1 30 students attempted to guess the age of a certain person based on a photograph. Each student provided their estimate. Treat these 30 students as a random sample from a population we wish to make inferences about. Should we reject the null hypothesis stating that the distribution of numbers provided by the students as estimates of the person’s age on the photo follows a Gaussian distribution?

Here are the numbers provided by the 30 students: 35, 35, 44, 36, 25, 27, 37, 35, 35, 39, 25, 34, 28, 37, 29, 39, 40, 38, 30, 32, 30, 40, 34, 27, 36, 27, 29, 34, 19, 39.

Exercise 22.2 The following file contains data on the election results of the candidate from the ruling party (the incumbent president's party) and the state of the economy, measured by the weighted average annual growth rate of personal income during the four years preceding the election.

hibbs.csv

The following information is included in the subsequent columns:

year – the year of the election,
income_growth – the weighted average annual growth rate over the four-year period preceding the election,
vote_share – the percentage of votes cast for the candidate from the ruling party (the incumbent president’s party, Democrats or Republicans),
incumbent_party_cand – the name of the ruling party's candidate,
other_cand – the name of the other candidate.

The data comes from the book Gelman, Hill, and Vehtari (2021).

Treat the data as if it were independently generated from a random process (a random sample from the population). Can the null hypothesis of no correlation between income growth and the percentage of votes for the ruling party’s candidate be rejected?

Provide a confidence interval for the correlation coefficient.

Literature

Gelman, A., J. Hill, and A. Vehtari. 2021. Regression and Other Stories. Analytical Methods for Social Research. Cambridge University Press. https://books.google.pl/books?id=fILoDwAAQBAJ.