4.6 Prediction
Prediction in logistic regression focuses mainly on predicting the values of the logistic curve \[ p(x_1,\ldots,x_k)=\mathbb{P}[Y=1|X_1=x_1,\ldots,X_k=x_k]=\frac{1}{1+e^{-(\beta_0+\beta_1x_1+\cdots+\beta_kx_k)}} \] by means of \[ \hat p(x_1,\ldots,x_k)=\hat{\mathbb{P}}[Y=1|X_1=x_1,\ldots,X_k=x_k]=\frac{1}{1+e^{-(\hat\beta_0+\hat\beta_1x_1+\cdots+\hat\beta_kx_k)}}. \] From the perspective of the linear model, this is the same as predicting the conditional mean (not the conditional response) of the response, but this time this conditional mean is also a conditional probability. The prediction of the conditional response is not so interesting since it follows immediately from \(\hat p(x_1,\ldots,x_k)\): \[ \hat{Y}|(X_1=x_1,\ldots,X_k=x_k)=\left\{\begin{array}{ll}1,&\text{with probability }\hat p(x_1,\ldots,x_k),\\0,&\text{with probability }1-\hat p(x_1,\ldots,x_k).\end{array}\right. \] As a consequence, we can predict \(Y\) as \(1\) if \(\hat p(x_1,\ldots,x_k)>\frac{1}{2}\) and as \(0\) if \(\hat p(x_1,\ldots,x_k)<\frac{1}{2}\).
Let’s focus then on how to make predictions and compute CIs in practice with predict
. Similarly to the linear model, the objects required for predict
are: first, the output of glm
; second, a data.frame
containing the locations \(\mathbf{x}=(x_1,\ldots,x_k)\) where we want to predict \(p(x_1,\ldots,x_k)\). However, there are two differences with respect to the use of predict
for lm
:
- The argument
type
.type = "link"
, gives the predictions in the log-odds, this is, returns \(\log\frac{\hat p(x_1,\ldots,x_k)}{1-\hat p(x_1,\ldots,x_k)}\).type = "response"
gives the predictions in the probability space \([0,1]\), this is, returns \(\hat p(x_1,\ldots,x_k)\). - There is no
interval
argument for usingpredict
forglm
. That means that there is no easy way of computing CIs for prediction.
Since it is a bit cumbersome to compute by yourself the CIs, we can code the function predictCIsLogistic
so that it computes them automatically for you, see below.
# Data for which we want a prediction
# Important! You have to name the column with the predictor name!
<- data.frame(temp = -0.6)
newdata
# Prediction of the conditional log-odds - the default
predict(nasa, newdata = newdata, type = "link")
## 1
## 7.833731
# Prediction of the conditional probability
predict(nasa, newdata = newdata, type = "response")
## 1
## 0.999604
# Function for computing the predictions and CIs for the conditional probability
<- function(object, newdata, level = 0.95) {
predictCIsLogistic
# Compute predictions in the log-odds
<- predict(object = object, newdata = newdata, se.fit = TRUE)
pred
# CI in the log-odds
<- qnorm(p = (1 - level) / 2)
za <- pred$fit + za * pred$se.fit
lwr <- pred$fit - za * pred$se.fit
upr
# Transform to probabilities
<- 1 / (1 + exp(-pred$fit))
fit <- 1 / (1 + exp(-lwr))
lwr <- 1 / (1 + exp(-upr))
upr
# Return a matrix with column names "fit", "lwr" and "upr"
<- cbind(fit, lwr, upr)
result colnames(result) <- c("fit", "lwr", "upr")
return(result)
}
# Simple call
predictCIsLogistic(nasa, newdata = newdata)
## fit lwr upr
## 1 0.999604 0.4838505 0.9999999
# The CI is large because there is no data around temp = -0.6 and
# that makes the prediction more variable (and also because we only
# have 23 observations)
For the challenger
dataset, do the following:
- Regress
fail.nozzle
ontemp
andpres.nozzle
. - Compute the predicted probability of
fail.nozzle=1
fortemp
\(=15\) andpres.nozzle
\(=200\). What is the predicted probability forfail.nozzle=0
? - Compute the confidence interval for the two predicted probabilities at level \(95\%\).
Finally, Figure 4.9 gives an interactive visualization of the CIs for the conditional probability in simple logistic regression. Their interpretation is very similar to the CIs for the conditional mean in the simple linear model, see Section 2.6 and Figure 2.23.