4.8 Model selection and multicollinearity
The same discussion we did in Section 3.7 is applicable to logistic regression with small changes:
- The deviance of the model (reciprocally the likelihood and the \(R^2\)) always decreases (increase) with the inclusion of more predictors – no matter whether they are significant or not.
- The excess of predictors in the model is paid by a larger variability in the estimation of the model which results in less precise prediction.
- Multicollinearity may hide significant variables, change the sign of them and result in an increase of the variability of the estimation.
The use of information criteria, stepwise
and vif
allow to efficiently fight back these issues. Let’s review them quickly from the perspective of logistic regression.
First, remember that the BIC/AIC information criteria are based on a balance between the model fitness, given by the likelihood, and its complexity. In the logistic regression, the BIC is
\[\begin{align*}
\text{BIC}(\text{model}) &= -2\log \text{lik}(\hat{\boldsymbol{\beta}}) + (k + 1)\times\log n\\
&=D+(k+1)\times \log n,
\end{align*}\]
where \(\text{lik}(\hat{\boldsymbol{\beta}})\) is the likelihood of the model. The AIC replaces \(\log n\) by \(2\), hence penalizing less model complexity. The BIC and AIC can be computed in R through the functions BIC
and AIC
, and we can check manually that they match with its definition.
# Models
<- glm(fail.field ~ temp, family = "binomial", data = challenger)
nasa1 <- glm(fail.field ~ temp + pres.field, family = "binomial",
nasa2 data = challenger)
# nasa
<- summary(nasa1)
summary1
summary1##
## Call:
## glm(formula = fail.field ~ temp, family = "binomial", data = challenger)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0566 -0.7575 -0.3818 0.4571 2.2195
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.5837 3.9146 1.937 0.0527 .
## temp -0.4166 0.1940 -2.147 0.0318 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 28.267 on 22 degrees of freedom
## Residual deviance: 20.335 on 21 degrees of freedom
## AIC: 24.335
##
## Number of Fisher Scoring iterations: 5
BIC(nasa1)
## [1] 26.60584
$deviance + 2 * log(dim(challenger)[1])
summary1## [1] 26.60584
AIC(nasa1)
## [1] 24.33485
$deviance + 2 * 2
summary1## [1] 24.33485
# nasa2
<- summary(nasa2)
summary2
summary2##
## Call:
## glm(formula = fail.field ~ temp + pres.field, family = "binomial",
## data = challenger)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2109 -0.6081 -0.4292 0.3498 2.0913
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 6.642709 4.038547 1.645 0.1000
## temp -0.435032 0.197008 -2.208 0.0272 *
## pres.field 0.009376 0.008821 1.063 0.2878
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 28.267 on 22 degrees of freedom
## Residual deviance: 19.078 on 20 degrees of freedom
## AIC: 25.078
##
## Number of Fisher Scoring iterations: 5
BIC(nasa2)
## [1] 28.48469
$deviance + 3 * log(dim(challenger)[1])
summary2## [1] 28.48469
AIC(nasa2)
## [1] 25.07821
$deviance + 3 * 2
summary2## [1] 25.07821
Second, stepwise
works analogously to the linear regression situation. Here is an illustration for a binary variable that measures whether a Boston suburb (Boston
dataset) is wealth or not. The binary variable is medv > 25
: it is TRUE
(1
) for suburbs with median house value larger than 25000$) and FALSE
(0
) otherwise. The cutoff 25000$ corresponds to the 25% richest suburbs.
# Boston dataset
data(Boston)
# Model whether a suburb has a median house value larger than 25000$
<- glm(I(medv > 25) ~ ., data = Boston, family = "binomial")
mod summary(mod)
##
## Call:
## glm(formula = I(medv > 25) ~ ., family = "binomial", data = Boston)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.3498 -0.2806 -0.0932 -0.0006 3.3781
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.312511 4.876070 1.090 0.275930
## crim -0.011101 0.045322 -0.245 0.806503
## zn 0.010917 0.010834 1.008 0.313626
## indus -0.110452 0.058740 -1.880 0.060060 .
## chas 0.966337 0.808960 1.195 0.232266
## nox -6.844521 4.483514 -1.527 0.126861
## rm 1.886872 0.452692 4.168 3.07e-05 ***
## age 0.003491 0.011133 0.314 0.753853
## dis -0.589016 0.164013 -3.591 0.000329 ***
## rad 0.318042 0.082623 3.849 0.000118 ***
## tax -0.010826 0.004036 -2.682 0.007314 **
## ptratio -0.353017 0.122259 -2.887 0.003884 **
## black -0.002264 0.003826 -0.592 0.554105
## lstat -0.367355 0.073020 -5.031 4.88e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 563.52 on 505 degrees of freedom
## Residual deviance: 209.11 on 492 degrees of freedom
## AIC: 237.11
##
## Number of Fisher Scoring iterations: 7
r2Log(mod)
## [1] 0.628923
# With BIC - ends up with only the significant variables and a similar R^2
<- stepwise(mod, trace = 0)
modBIC ##
## Direction: backward/forward
## Criterion: BIC
summary(modBIC)
##
## Call:
## glm(formula = I(medv > 25) ~ indus + rm + dis + rad + tax + ptratio +
## lstat, family = "binomial", data = Boston)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.3077 -0.2970 -0.0947 -0.0005 3.2552
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.556433 3.948818 0.394 0.693469
## indus -0.143236 0.054771 -2.615 0.008918 **
## rm 1.950496 0.441794 4.415 1.01e-05 ***
## dis -0.426830 0.111572 -3.826 0.000130 ***
## rad 0.301060 0.076542 3.933 8.38e-05 ***
## tax -0.010240 0.003631 -2.820 0.004800 **
## ptratio -0.404964 0.112086 -3.613 0.000303 ***
## lstat -0.384823 0.069121 -5.567 2.59e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 563.52 on 505 degrees of freedom
## Residual deviance: 215.03 on 498 degrees of freedom
## AIC: 231.03
##
## Number of Fisher Scoring iterations: 7
r2Log(modBIC)
## [1] 0.6184273
Finally, multicollinearity can also be present in logistic regression. Despite the nonlinear logistic curve, the predictors are combined linearly in (4.4). Due to this, if two or more predictors are highly correlated between them, the fit of the model will be compromised since the individual linear effect of each predictor is hard to disentangle from the rest of correlated predictors.
In addition to inspecting the correlation matrix and look for high correlations, a powerful tool to detect multicollinearity is the VIF of each coefficient \(\hat\beta_j\). The situation is exactly the same as in linear regression, since VIF looks only into the linear relations of the predictors. Therefore, the rule of thumb gives is the same as in Section 3.8:
- VIF close to 1: absence of multicollinearity.
- VIF larger than 5 or 10: problematic amount of multicollinearity. Advised to remove the predictor with largest VIF.
Here is an example illustrating the use of VIF, through vif
, in practice. It also shows also how the simple inspection of the covariance matrix is not enough for detecting collinearity in tricky situations.
# Create predictors with multicollinearity: x4 depends on the rest
set.seed(45678)
<- rnorm(100)
x1 <- 0.5 * x1 + rnorm(100)
x2 <- 0.5 * x2 + rnorm(100)
x3 <- -x1 + x2 + rnorm(100, sd = 0.25)
x4
# Response
<- 1 + 0.5 * x1 + 2 * x2 - 3 * x3 - x4
z <- rbinom(n = 100, size = 1, prob = 1/(1 + exp(-z)))
y <- data.frame(x1 = x1, x2 = x2, x3 = x3, x4 = x4, y = y)
data
# Correlations - none seems suspicious
cor(data)
## x1 x2 x3 x4 y
## x1 1.0000000 0.38254782 0.2142011 -0.5261464 0.20198825
## x2 0.3825478 1.00000000 0.5167341 0.5673174 0.07456324
## x3 0.2142011 0.51673408 1.0000000 0.2500123 -0.49853746
## x4 -0.5261464 0.56731738 0.2500123 1.0000000 -0.11188657
## y 0.2019882 0.07456324 -0.4985375 -0.1118866 1.00000000
# Abnormal generalized variance inflation factors: largest for x4, we remove it
<- glm(y ~ x1 + x2 + x3 + x4, family = "binomial")
modMultiCo vif(modMultiCo)
## x1 x2 x3 x4
## 27.84756 36.66514 4.94499 36.78817
# Without x4
<- glm(y ~ x1 + x2 + x3, family = "binomial")
modClean
# Comparison
summary(modMultiCo)
##
## Call:
## glm(formula = y ~ x1 + x2 + x3 + x4, family = "binomial")
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4743 -0.3796 0.1129 0.4052 2.3887
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.2527 0.4008 3.125 0.00178 **
## x1 -3.4269 1.8225 -1.880 0.06007 .
## x2 6.9627 2.1937 3.174 0.00150 **
## x3 -4.3688 0.9312 -4.691 2.71e-06 ***
## x4 -5.0047 1.9440 -2.574 0.01004 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 132.81 on 99 degrees of freedom
## Residual deviance: 59.76 on 95 degrees of freedom
## AIC: 69.76
##
## Number of Fisher Scoring iterations: 7
summary(modClean)
##
## Call:
## glm(formula = y ~ x1 + x2 + x3, family = "binomial")
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0952 -0.4144 0.1839 0.4762 2.5736
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.9237 0.3221 2.868 0.004133 **
## x1 1.2803 0.4235 3.023 0.002502 **
## x2 1.7946 0.5290 3.392 0.000693 ***
## x3 -3.4838 0.7491 -4.651 3.31e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 132.813 on 99 degrees of freedom
## Residual deviance: 68.028 on 96 degrees of freedom
## AIC: 76.028
##
## Number of Fisher Scoring iterations: 6
r2Log(modMultiCo)
## [1] 0.5500437
r2Log(modClean)
## [1] 0.4877884
# Generalized variance inflation factors normal
vif(modClean)
## x1 x2 x3
## 1.674300 2.724351 3.743940
For the Boston
dataset, do the following:
- Compute the hit matrix and hit ratio for the regression
I(medv > 25) ~ .
(hint: dotable(medv > 25, ...)
). - Fit
I(medv > 25) ~ .
but now using only the first 300 observations ofBoston
, the training dataset (hint: usesubset
). - For the previous model, predict the probability of the responses and classify them into
0
or1
in the last 206 observations, the testing dataset (hint: usepredict
on that subset). - Compute the hit matrix and hit ratio for the new predictions. Check that the hit ratio is smaller than the one in the first point. The hit ratio on the testing dataset, and not the first hit ratio, is an estimator of how well the model is going to classify future observations.