B Use of qualitative predictors in regression
An important situation not covered in Chapters 2, 3 and 4 is how to deal with qualitative, and not quantitative, predictors. Qualitative predictors, also known as categorical variables or, in R’s terminology, factors, are ubiquitous in social sciences. Dealing with them requires some care and proper understanding of how these variables are represented in statistical softwares such as R.
Two levels
The simplest case is the situation with two levels, this is, the binary case covered in logistic regression. There we saw that a binary variable \(C\) with two levels (for example, a and b) could be represented as \[ D=\left\{\begin{array}{ll}1,&\text{if }C=b,\\0,& \text{if }C=a.\end{array}\right. \] \(D\) now is a dummy variable: it codifies with zeros and ones the two possible levels of the categorical variable. An example of \(C\) could be gender, which has levels male and female. The dummy variable associated is \(D=0\) if the gender is male and \(D=1\) if the gender is female.
The advantage of this dummification is its interpretability in regression models. Since level a corresponds to \(0\), it can be seen as the reference level to which level b is compared. This is the key point in dummification: set one level as the reference and codify the rest as departures from it with ones.
The previous interpretation translates easily to regression models. Assume that the dummy variable \(D\) is available together with other predictors \(X_1,\ldots,X_k\). Then:
Linear model \[ \mathbb{E}[Y|X_1=x_1,\ldots,X_k=x_k,D=d]=\beta_0+\beta_1X_1+\cdots+\beta_kX_k+\beta_{k+1}D. \] The coefficient associated to \(D\) is easily interpretable. \(\beta_{k+1}\) is the increment in mean of \(Y\) associated to changing \(D=0\) (reference) to \(D=1\), while the rest of the predictors are fixed. Or in other words, \(\beta_{k+1}\) is the increment in mean of \(Y\) associated to changing of the level of the categorical variable from a to b.
Logistic model \[ \mathbb{P}[Y=1|X_1=x_1,\ldots,X_k=x_k,D=d]=\text{logisitic}(\beta_0+\beta_1X_1+\cdots+\beta_kX_k+\beta_{k+1}D). \] We have two interpretations of \(\beta_{k+1}\), either in terms of log-odds or odds:
- \(\beta_{k+1}\) is the additive increment in log-odds of \(Y\) associated to changing of the level of the categorical variable from a (reference, \(D=0\)) to b (\(D=1\)).
- \(e^{\beta_{k+1}}\) is the multiplicative increment in odds of \(Y\) associated to changing of the level of the categorical variable from a (reference, \(D=0\)) to b (\(D=1\)).
R does the dummification automatically (translates a categorical variable \(C\) into its dummy version \(D\)) if it detects that a factor variable is present in the regression model. Let’s see an example of this in linear and logistic regression.
# Load the Boston dataset
library(MASS)
data(Boston)
# Structure of the data
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
# chas is a dummy variable measuring if the suburb is close to the river (1)
# or not (0). In this case it is not codified as a factor but as a 0 or 1.
# Summary of a linear model
<- lm(medv ~ chas + crim, data = Boston)
mod summary(mod)
##
## Call:
## lm(formula = medv ~ chas + crim, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.540 -5.421 -1.878 2.575 30.134
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.61403 0.41862 56.409 < 2e-16 ***
## chas 5.57772 1.46926 3.796 0.000165 ***
## crim -0.40598 0.04339 -9.358 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.373 on 503 degrees of freedom
## Multiple R-squared: 0.1744, Adjusted R-squared: 0.1712
## F-statistic: 53.14 on 2 and 503 DF, p-value: < 2.2e-16
# The coefficient associated to chas is 5.57772. That means that if the suburb
# is close to the river, the mean of medv increases in 5.57772 units.
# chas is significant (the presence of the river adds a valuable information
# for explaining medv)
# Create a binary response (1 expensive suburb, 0 inexpensive)
$expensive <- Boston$medv > 25
Boston
# Summary of a logistic model
<- glm(expensive ~ chas + crim, data = Boston, family = "binomial")
mod summary(mod)
##
## Call:
## glm(formula = expensive ~ chas + crim, family = "binomial", data = Boston)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.26764 -0.84292 -0.67854 -0.00099 2.87470
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.82159 0.12217 -6.725 1.76e-11 ***
## chas 1.04165 0.36962 2.818 0.00483 **
## crim -0.22816 0.05265 -4.333 1.47e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 563.52 on 505 degrees of freedom
## Residual deviance: 513.44 on 503 degrees of freedom
## AIC: 519.44
##
## Number of Fisher Scoring iterations: 6
# The coefficient associated to chas is 1.04165. That means that if the suburb
# is close to the river, the log-odds of expensive increases by 1.04165.
# Alternatively, the odds of expensive increases by a factor of exp(1.04165).
# chas is significant (the presence of the river adds a valuable information
# for explaining medv)
More than two levels
Let’s see now the case with more than two levels, for example, a categorical variable \(C\) with levels a, b and c. If we take a as the reference level, this variable can be represented by two dummy variables: \[ D_1=\left\{\begin{array}{ll}1,&\text{if }C=b,\\0,& \text{if }C\neq b\end{array}\right. \] and \[ D_2=\left\{\begin{array}{ll}1,&\text{if }C=c,\\0,& \text{if }C\neq c.\end{array}\right. \] Then \(C=a\) is represented by \(D_1=D_2=0\), \(C=b\) is represented by \(D_1=1,D_2=0\) and \(C=c\) is represented by \(D_1=0,D_2=1\). The interpretation of the regression models with the presence of \(D_1\) and \(D_2\) is the very similar to the one before. For example, for the linear model, the coefficient associated to \(D_1\) gives the increment in mean of \(Y\) when the category of \(C\) changes from a to b. The coefficient for \(D_2\) gives the increment in mean of \(Y\) when it changes from a to c.
In general, if we have a categorical variable with \(J\) levels, then the number of dummy variables required is \(J-1\). Again, R does the dummification automatically for you if it detects that a factor variable is present in the regression model.
# Load dataset - factors in the last column
data(iris)
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
# Summary of a linear model
<- lm(Sepal.Length ~ ., data = iris)
mod1 summary(mod1)
##
## Call:
## lm(formula = Sepal.Length ~ ., data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.79424 -0.21874 0.00899 0.20255 0.73103
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.17127 0.27979 7.760 1.43e-12 ***
## Sepal.Width 0.49589 0.08607 5.761 4.87e-08 ***
## Petal.Length 0.82924 0.06853 12.101 < 2e-16 ***
## Petal.Width -0.31516 0.15120 -2.084 0.03889 *
## Speciesversicolor -0.72356 0.24017 -3.013 0.00306 **
## Speciesvirginica -1.02350 0.33373 -3.067 0.00258 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3068 on 144 degrees of freedom
## Multiple R-squared: 0.8673, Adjusted R-squared: 0.8627
## F-statistic: 188.3 on 5 and 144 DF, p-value: < 2.2e-16
# Speciesversicolor (D1) coefficient: -0.72356. The average increment of
# Sepal.Length when the species is versicolor instead of setosa (reference).
# Speciesvirginica (D2) coefficient: -1.02350. The average increment of
# Sepal.Length when the species is virginica instead of setosa (reference).
# Both dummy variables are significant
# How to set a different level as reference (versicolor)
$Species <- relevel(iris$Species, ref = "versicolor")
iris
# Same estimates except for the dummy coefficients
<- lm(Sepal.Length ~ ., data = iris)
mod2 summary(mod2)
##
## Call:
## lm(formula = Sepal.Length ~ ., data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.79424 -0.21874 0.00899 0.20255 0.73103
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.44770 0.28149 5.143 8.68e-07 ***
## Sepal.Width 0.49589 0.08607 5.761 4.87e-08 ***
## Petal.Length 0.82924 0.06853 12.101 < 2e-16 ***
## Petal.Width -0.31516 0.15120 -2.084 0.03889 *
## Speciessetosa 0.72356 0.24017 3.013 0.00306 **
## Speciesvirginica -0.29994 0.11898 -2.521 0.01280 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3068 on 144 degrees of freedom
## Multiple R-squared: 0.8673, Adjusted R-squared: 0.8627
## F-statistic: 188.3 on 5 and 144 DF, p-value: < 2.2e-16
# Speciessetosa (D1) coefficient: 0.72356. The average increment of
# Sepal.Length when the species is setosa instead of versicolor (reference).
# Speciesvirginica (D2) coefficient: -0.29994.s The average increment of
# Sepal.Length when the species is virginica instead of versicolor (reference).
# Both dummy variables are significant
# Coefficients of the model
confint(mod2)
## 2.5 % 97.5 %
## (Intercept) 0.8913266 2.00408209
## Sepal.Width 0.3257653 0.66601260
## Petal.Length 0.6937939 0.96469395
## Petal.Width -0.6140049 -0.01630542
## Speciessetosa 0.2488500 1.19827390
## Speciesvirginica -0.5351144 -0.06475727
# The coefficients of Speciesversicolor and Speciesvirginica are significantly
# negative. Therefore, there are significant
Do not codify a categorical variable as a discrete variable. This constitutes a major methodological fail that will flaw the subsequent statistical analysis.
For example if you have a categorical variable party
with levels partyA
, partyB
and partyC
, do not encode it as a discrete variable taking the values 1
, 2
and 3
, respectively. If you do so:
-
You assume implicitly an order in the levels of
party
, sincepartyA
is closer topartyB
than topartyC
. -
You assume implicitly that
partyC
is three times larger thanpartyA
. -
The codification is completely arbitrary – why not considering
1
,1.5
and1.75
instead of?
The right way of dealing with categorical variables in regression is to set the variable as a factor and let R do internally the dummification.