17 Hypothesis testing — two proportions

17.1 Two proportions test for large samples

In the two proportions test, the null hypothesis states that the difference between the proportions in the populations from which we sample two groups is zero⁸:

\[ H_0: p_1-p_2 = 0,\] where \(p_1\) is the proportion in the first population and \(p_2\) is the proportion in the second population.

Similarly to the tests discussed earlier, we can have three options for the alternative hypothesis: the two-tailed test:

\[H_A: p_1-p_2 \ne 0\]

the left-tailed test:

\[H_A: p_1-p_2 < 0 \]

and the right-tailed test

\[H_A: p_1-p_2 > 0 \]

The above hypotheses can be written as \(H_A:p_1\ne p_2\), \(H_A:p_1<p_2\), and \(H_A:p_1>p_2\), with the null hypothesis as \(H_0: p_1=p_2\).

Test statistic \(z\) is:

\[z=\frac{\hat{p}_1-\hat{p}_2}{\sqrt{\hat{p}\hat{q} \left(\frac{1}{n_1}+\frac{1}{n_2}\right)}},\] where \(\hat{p}_1\) and \(\hat{p}_2\) are sample 1 and 2, proportions and \(\hat{p}\) is the pooled proportion in these samples:

\[\hat{p}_1=\frac{x_1}{n_1}, \:\: \hat{p}_2=\frac{x_2}{n_2}, \:\: \hat{p}=\frac{x_1+x_2}{n_1+n_2}, \:\: \hat{q}=1-\hat{p}\]

The rejection region is chosen as in other \(z\) tests.
NIt should be noted that the assumption of the test is that the samples are independently drawn from two populations (or processes). Another requirement is that both samples must be sufficiently large for the normal approximation to be applicable. In practice, we can consider the samples to be large enough if \(n_1 \hat{p}_1\geq 15\), \(n_1 \hat{q}_1\geq 15\), \(n_2 \hat{p}_2\geq 15\) i \(n_2 \hat{q}_2\geq 15\). Sometimes, instead of \(15\), a minimum of \(5\) is used.

17.2 Formulas

Difference between proportions in two populations – confidence interval:

\[\begin{equation} (\hat{p}_1-\hat{p}_2)\pm z_{\alpha/2}{\sqrt{\frac{\hat{p}_1\hat{q}_1}{n_1}+\frac{\hat{p}_2\hat{q}_2}{n_2}}} \tag{17.1} \end{equation}\]

Difference between proportions in two populations – test:

\[\begin{equation} \begin{split} z=\frac{(\hat{p}_1-\hat{p})-0}{\sqrt{\hat{p}\hat{q} \left(\frac{1}{n_1}+\frac{1}{n_2}\right)}} \\ {}\hat{p}_1=\frac{x_1}{n_1}, \:\: \hat{p}_2=\frac{x_2}{n_2}, \:\: \hat{p}=\frac{x_1+x_2}{n_1+n_2} \end{split} \tag{17.2} \end{equation}\]

17.3 Effect size

When comparing two proportions, the effect size can be measured in several ways. We can use:

(absolute) difference between proportions: \(|\hat{p}_1-\hat{p}_2|\),
the ratio of proportions: \(\hat{p}_1/\hat{p}_2\) or \(\hat{p}_2/\hat{p}_1\),
the odds ratio: \(\hat{o}_1/\hat{o}_2\) or \(\hat{o}_2/\hat{o}_1\), where \(\hat{o}_i = \hat{p}_i/(1-\hat{p}_i)\).

The last option (odds ratio) is perhaps the most commonly used in the literature, particularly in medical literature.

17.4 Templates

Arkusze kalkulacyjne

Tests and confidence intervals – 2 proportions — Google spreadsheet

Tests and confidence intervals – 2 proportions — Excel template

R code

# Test for 2 proportions
# Sample 1 size:
n1 <- 24
# Number of favourable observations in sample 1:
x1 <- 21
# Sample 1 proportion:
phat1 <- x1/n1
# Sample 2 size:
n2 <- 24
# Number of favourable observations in sample 2:
x2 <- 14
# Sample 2 proportion:
phat2 <- x2/n2


# Significance level:
alpha <- 0.05
# Alternative (sign): "<"; ">"; "<>"/"≠"
alt <- ">"

alttext <- if(alt==">") {"greater"} else if(alt=="<") {"less"} else {"two.sided"}

test <- prop.test(c(x1, x2), c(n1, n2), alternative=alttext, correct=FALSE)
test_z <- unname(-sign(diff(test$estimate))*sqrt(test$statistic))
crit_z <- if(test$alternative=="less") {qnorm(alpha)} else if(test$alternative=="greater") {qnorm(1-alpha)} else {qnorm(1-alpha/2)}

print(c('Sample proportions ' = test$estimate, 
        'Sample sizes ' = c(n1, n2),
        'Null hypothesis' = paste0('p1-p2 = ', 0),
        'Alt. hypothesis' = paste0('p1-p2 ', alt, ' ', 0),
        'Z test statistic' = test_z,
        'Chi^2 test statistic' = unname(test$statistic),
        'Z test critival value' = crit_z,
        'Chi^2 test critival value' = crit_z^2,
        'P-value' = test$p.value
))

## Sample proportions .prop 1 Sample proportions .prop 2             Sample sizes 1             Sample sizes 2 
##                    "0.875"        "0.583333333333333"                       "24"                       "24" 
##            Null hypothesis            Alt. hypothesis           Z test statistic       Chi^2 test statistic 
##                "p1-p2 = 0"                "p1-p2 > 0"         "2.27359424023522"         "5.16923076923077" 
##      Z test critival value  Chi^2 test critival value                    P-value 
##         "1.64485362695147"         "2.70554345409541"       "0.0114951970462325"

Python code

from statsmodels.stats.proportion import proportions_ztest
from scipy.stats import norm, chi2_contingency
import numpy as np

n1 = 24
x1 = 21
phat1 = x1 / n1

n2 = 24
x2 = 14
phat2 = x2 / n2

alpha = 0.05
alt = ">"

if alt == ">":
    alttext = "larger"
elif alt == "<":
    alttext = "smaller"
else:
    alttext = "two-sided"

test_result = proportions_ztest(count = np.array([x1, x2]), nobs = np.array([n1, n2]), alternative = alttext)

print("Z test statistic:", test_result[0], 
"\np-value:", test_result[1])

## Z test statistic: 2.2735942402352203 
## p-value: 0.011495197046232447

17.5 Exercises

Exercise 17.1 (Aczel and Sounderpandian 2018) In 2008-2009, the issue of mortgage non-performing loans was a problem in the American economy. According to USA Today, the percentage of homeowners with mortgages who did not make timely payments was 4.9% in some regions in the West, and 6.9% in the South. Knowing that the above data were calculated based on two independent random samples, each with a size of 1000, test the equality of the fractions of non-reliable borrowers. Assume α = 0.05.

Exercise 17.2 In 1972, 48 bank managers were given the same personnel files. Each was asked to assess whether a particular person should be promoted to branch manager or whether other candidates should be considered. The files were identical, except that in half the cases, it was stated that the person was a woman, and in half, that the person was a man. Of the 24 "male" cases, a promotion was recommended in 21, and of the 24 "female" cases, 14 were recommended for promotion (Rosen and Jerdee 1974). Is this convincing evidence that the managers discriminated against female applicants? Or can the difference in the numbers recommended for promotion be confidently attributed to chance?

Exercise 17.3 French sociologists study what factors may influence tipping behaviour and the amount of tips given (see Guéguen and Jacob 2005; Guéguen 2012). In doing so, it often turns out that a variable differentiating the frequency of tipping is the gender of the customer (and also the waiter). In the study Guéguen and Jacob (2005), a female bartender received a tip from 21 out of 97 men and from 4 out of 46 women she served. Is the difference statistically significant (α = 0.05)?

In the study Guéguen (2012), the behaviour of 503 male customers (217 tips) and 344 female customers (104 tips) at a certain restaurant was examined. Is the difference statistically significant in this case?

Is the effect size similar in these two studies?

Exercise 17.4 In the study Guéguen (2012), the correlation between tipping and the colour of the waitress's hair (more precisely, her wig) was examined at a restaurant. When the waitress wore a blonde wig, 73 out of 130 men tipped, while for other colours, 217 out of 503 men decided to tip. For women, the relationship was as follows: blonde waitress – 25/90, non-blonde waitress – 79/254. Was the association between tips and hair colour statistically significant for both genders of customers?

Exercise 17.5 A horse known as "Clever Hans" was said to be able to find the correct answer to a mathematical question and tap the correct number with its hoof. In a study conducted in 1911, it was found that Hans was able to give the correct answer in 89% of cases (50/56) when he could see the questioner, and only in 6% of cases (2/35) when he could not (Pfungst 2012). The questioner knew the answers to the questions. These results showed that Hans was not clever because he knew mathematics, but because he could read the body language of the questioner. Check the statistical significance of the study.

Literature

Aczel, A. D., and J. Sounderpandian. 2018. Statystyka w Zarządzaniu. PWN. https://ksiegarnia.pwn.pl/Statystyka-w-zarzadzaniu,731934758,p.html.

Guéguen, Nicolas. 2012. “Hair Color and Wages: Waitresses with Blond Hair Have More Fun.” The Journal of Socio-Economics 41 (4): 370–72. https://doi.org/10.1016/j.socec.2012.04.012.

Guéguen, Nicolas, and Céline Jacob. 2005. “The Effect of Touch on Tipping: An Evaluation in a French Bar.” International Journal of Hospitality Management 24 (2): 295–99. https://doi.org/10.1016/j.ijhm.2004.06.004.

Pfungst, Oskar. 2012. Clever Hans; (the Horse of Mr. Von Osten.) a Contribution to Experimental Animal and Human Psychology.

Rosen, Benson, and Thomas H. Jerdee. 1974. “Influence of Sex Role Stereotypes on Personnel Decisions.” Journal of Applied Psychology 59: 9–14. https://doi.org/10.1037/h0035834.

There are also tests for other values of \(D_0\), but we will not discuss them here.↩︎