Chapter 3 Week 5
3.1 Sample Size Calculation (Optional)
3.1.1 Method 1: power.prop.test
- Available in R base (no additional packages needed).
power.prop.test(n = NULL,
p1 = NULL, p2 = NULL,
sig.level = 0.05,
power = NULL,
alternative = c("two.sided", "one.sided"))
The input for the function are:
n
: number of observations (per group)p1
&p2
: probability in two groupssig.level
: significance level (type I error probability), default value is 0.05 (\(\alpha = 0.05\))power
: power of test (1 - Type II error probability, \(1-\beta\))alternative
: one- or two-sided test.
Your input can be four of five parameters in
n
,p1
,p2
,sig.level
andpower
, then the output would include the calculated “missing” parameter.Determine parameters to obtain a target power, or compute the power of the two-sample test for proportions.
Say you want to know what sample size you need for an experiment in which you’re seeking to determine whether or not the difference in two proportions of success is statistically significant with target power.
Look at historical data to establish baseline predictions. Say that in the past, \(P(D \mid \bar{E}) = 0.10\) and \(P(D \mid E)=0.17\). Assume that these probabilities are based on relatively large amounts of data.
# determine parameters to obtain a target power
<- power.prop.test(
n.out p1 = 0.17, p2 = 0.10,
sig.level = 0.05,
power = 0.8,
alternative = "two.sided")
n.out
##
## Two-sample comparison of proportions power calculation
##
## n = 372.9221
## p1 = 0.17
## p2 = 0.1
## sig.level = 0.05
## power = 0.8
## alternative = two.sided
##
## NOTE: n is number in *each* group
NOTE: n is number in each group
- Thus, you need around \(n/2 \ge 373\) in each arm, and a total of \(n \ge 746\).
Exercise:
Change the value of sig.level, power, p1 and p2 to explore how sample size are affected by these parameters.
Say you decided to draw a sample with 375 observations in each arm, how much power can you acheive in this test?
# Compute the power of the two-sample test for proportions
<- power.prop.test(
power.out n=375,
p1=0.17, p2 = 0.10,
sig.level = 0.05, alternative = "two.sided")
power.out
##
## Two-sample comparison of proportions power calculation
##
## n = 375
## p1 = 0.17
## p2 = 0.1
## sig.level = 0.05
## power = 0.8021829
## alternative = two.sided
##
## NOTE: n is number in *each* group
- Thus, you can achieve 0.80 power in this test.
3.1.2 Method 2: Formula and qnorm
\[n > \frac{2*(Z_{1-\alpha/2}+Z_{1-\beta})^2*[p_1*(1-p_1)+p_2*(1-p_2)]}{(p_1-p_2)^2}\]
- To get the Z-value
corresponding to a given probability in the formula, we use qnorm
:
qnorm(p, mean = 0, sd = 1, lower.tail = TRUE)
- The input for the function are:
p
: (vector of) probability(ies).mean
andsd
: (vector of) mean(s) and standard deviation(s).lower.tail
: logical; if TRUE (default), probabilities are \(P[X \le x]\), otherwise, \(P[X > x]\).
# Z-scores
= qnorm(1-0.05/2)
zalpha = qnorm(0.8)
zpower c(zalpha,zpower)
## [1] 1.9599640 0.8416212
= 0.17
p1 = 0.10
p2 <- (2*((zalpha+zpower)^2)*(p1*(1-p1)+p2*(1-p2)))/((p1-p2)^2)
n n
## [1] 740.3576
NOTE: n is number in total
- Thus, you need a total of \(n \ge 741\).
3.2 epitable
- Create r x c contigency table for r exposure levels and c outcome levels
epitable(..., ncol =2, byrow = TRUE,
rev = c("neither", "rows", "columns", "both"))
...
:- four or more integers that be converted into r x 2 table (the number of integers must be even), e.g.
epitable(88, 20, 555, 347)
- four or more integers that be converted into r x 2 table (the number of integers must be even), e.g.
- two categorical vectors (1st vector is exposure with r levels, 2nd vector is outcome with 2 levels), e.g.
epitable("Exposure"=exposure, "Disease"=outcome)
, orepitable(dat$expo, dat$out)
- two categorical vectors (1st vector is exposure with r levels, 2nd vector is outcome with 2 levels), e.g.
- r x 2 contingency table, e.g.
epitable(matrix(1:6, 3, 2))
- r x 2 contingency table, e.g.
- single vector that be converted into r x 2 table (the number of integers must be even), e.g.
epitable(c(88, 20, 555, 347))
- single vector that be converted into r x 2 table (the number of integers must be even), e.g.
ncol
: Defaultncol=2
, number of columns when a table is constructed from a vector or sequence of numbersbyrow
: Defaultbyrow=TRUE
and single vector or collection of numbers is read in row-wise.- Set to FALSE to read in column-wise.
- e.g.
epitable(88, 20, 555, 347)
, by default 88, 20 would be regarded as the first row (under the same exposure status, Diseased and No disease respectively), and 555, 347 the second under the other exposure status. - e.g.
epitable(88, 20, 555, 347, byrow = FALSE)
, then 88, 20 would be regarded as the first column (under the same disease status, Exposed and Unexposed respectively), and 555, 347 the second column under the other disease status.
rev
: Defaultrev="neither"
means no order reversal in the process, can be“rows,” “colums” or “both” to reverse the orders.- Usually the table we get from (2) two categorical vectors would be like:
<- read.dta("_data/diet.dta")
dat epitable(dat$diet, dat$mort)
## Outcome
## Predictor 0 1
## 0 186 136
## 1 119 59
- In the output table, the outcomes and predictors (exposure) are listed by 0 and then 1; But in the 2 by 2 tables we usually used for further calculation, the top left cell should be Exposed and Disease (Predictor = 1 & Outcome = 1)
Disease + | Disease - | |
---|---|---|
Expose + | ||
Expose - |
- So `rev="both"` is added:
epitable(dat$diet, dat$mort, rev="both")
## Outcome
## Predictor 1 0
## 1 59 119
## 0 136 186
- The contingency table created by this function is usually used for additional analyses, for example, the
epitab
function.
3.3 epitab
- Calculates risks, risk ratio, odds ratio, and confidence intervals for epidemiologic data
epitab(x, y = NULL,
method = c("oddsratio", "riskratio", "rateratio"),
conf.level = 0.95,
rev = c("neither", "rows", "columns", "both"),
oddsratio = c("wald", "fisher", "midp", "small"),
riskratio = c("wald", "boot", "small"),
rateratio = c("wald", "midp"),
pvalue = c("fisher.exact", "midp.exact", "chi2"),
correction = FALSE,
verbose = FALSE)
- Input data:
odds.ratio.or.risk.ratio | rate.ratio | |
---|---|---|
x = r x 2 table | \[\checkmark\] | \[\checkmark\] (1stcol: disease counts; 2nd col: person time at risk) |
x = numeric vectors | \[\checkmark\] (transformed into r x 2 table in row-wise order) | \[\checkmark\] (counts followed by person time at risk) |
x and y | \[\checkmark\] (both are single factor or character vectors) | \[\checkmark\] (x = single numeric vector of counts, y = numeric vector of corresponding person time at risk) |
Other arguments and default values
measure of association:
method = "oddsratio"
confidence level:
conf.level = 0.05
estimation methods: The default of
oddsratio
,riskratio
andrateratio
are"wald"
pvalue = "fisher.exact"
correction = FALSE
means no Yate’s continuity correction.verbose = FALSE
, set toTRUE
to return more detailed results
The function
epitab
requires:
Disease - | Disease + | |
---|---|---|
Expose - | ||
Expose + |
<- epitab(epitable(dat$diet, dat$mort))
stat stat
## $tab
## Outcome
## Predictor 0 p0 1 p1 oddsratio lower upper p.value
## 0 186 0.6098361 136 0.6974359 1.0000000 NA NA NA
## 1 119 0.3901639 59 0.3025641 0.6780771 0.462563 0.9940021 0.05535248
##
## $measure
## [1] "wald"
##
## $conf.level
## [1] 0.95
##
## $pvalue
## [1] "fisher.exact"
- Outputs:
$tab stat
## Outcome
## Predictor 0 p0 1 p1 oddsratio lower upper p.value
## 0 186 0.6098361 136 0.6974359 1.0000000 NA NA NA
## 1 119 0.3901639 59 0.3025641 0.6780771 0.462563 0.9940021 0.05535248
- The epidemiologic tabulation generated.
stat$measure
[1] "wald"
- Wald (i.e. Normal approximation) is used to generate the CIs
$statconf.level
[1] 0.95
- The confidence level is set to 95%
stat$pvalue
[1] "fisher.exact"
- The Fisher exact test was used to generate the p-value.