Lecture 4 Generalized Regression

4.1 Overview

In this lecture, we introduce Generalized Linear Models (GLMs), an extension of linear regression designed for situations where the dependent variable follows any distribution from the exponential family, expanding beyond the traditional normality assumption. Using an example with a Poisson-distributed response variable, we illustrate how these models are constructed and how the linear predictor is connected to the outcome. Additionally, we explore the properties of the exponential family, including its parameters, moments, and some examples. Finally, we present the general model equation for GLMs.

4.2 Introduction: Allbus Dataset

In this chapter, we introduce the Allbus (Allgemeine Bevölkerungsumfrage der Sozialwissenschaften) dataset, a large-scale survey that has been conducted since 1990/1991. It provides valuable insights into the attitudes, behaviors, and demographics of the German population. You can download the data on the official website of the GESIS.

For this lecture, we use a smaller subset of the Allbus dataset, focusing on data from the years 2014 and 2018. The following variables are contained in the dataset:

Name	Description
respid	Unique identifier for each respondent
di07	Migration background of the respondent (0 = No, 1 = Yes)
age	Age of the respondent
sex	Gender of the respondent (1 = Male, 2 = Female)
iscd11	Highest level of education (ISCED 2011)
work	Employment status (1 = Employed, 2 = Unemployed, etc.)
eastwest	Region of residence (former East/West Germany) (1 = West, 2 = East)
german	German citizenship (0 = No, 1 = Yes)
feduc	Father’s highest level of education
meduc	Mother’s highest level of education
isei08	Socioeconomic index of occupational status
worktime	Weekly working hours
tv	Daily television consumption (in minutes)

In the dropdown you can find summary statistics for these variables:

Summary

allbus <- readRDS("./data/allbus_small.rds")
summary(allbus[,c(1:6)])
#>      respid            di07           sex     
#>  Min.   :   1.0   Min.   :  183   male  :847  
#>  1st Qu.: 899.5   1st Qu.: 1000   female:716  
#>  Median :1755.0   Median : 1400               
#>  Mean   :1768.3   Mean   : 1625               
#>  3rd Qu.:2648.0   3rd Qu.: 2000               
#>  Max.   :3476.0   Max.   :12600               
#>                                               
#>       age                                iscd11   
#>  Min.   :18.00   upper secondary            :600  
#>  1st Qu.:36.00   Masters or equiv           :414  
#>  Median :47.00   short-cycle tertiary       :221  
#>  Mean   :45.17   post secondary non-tertiary:146  
#>  3rd Qu.:55.00   Bachelor's or equiv        : 71  
#>  Max.   :78.00   lower secondary            : 60  
#>                  (Other)                    : 51  
#>            work     
#>  full-time   :1245  
#>  part-time   : 318  
#>  side job    :   0  
#>  not employed:   0  
#>                     
#>                     
#> 
summary(allbus[,c(7:13)])
#>  eastwest    german         feduc           meduc      
#>  west:1111   yes:1469   Min.   :1.000   Min.   :1.000  
#>  east: 452   no :  94   1st Qu.:2.000   1st Qu.:2.000  
#>                         Median :2.000   Median :2.000  
#>                         Mean   :2.855   Mean   :2.741  
#>                         3rd Qu.:3.000   3rd Qu.:3.000  
#>                         Max.   :6.000   Max.   :6.000  
#>      isei08         worktime           tv       
#>  Min.   :11.74   Min.   : 3.00   Min.   :  0.0  
#>  1st Qu.:30.78   1st Qu.:35.00   1st Qu.: 60.0  
#>  Median :54.55   Median :40.00   Median :120.0  
#>  Mean   :51.44   Mean   :38.88   Mean   :111.4  
#>  3rd Qu.:70.34   3rd Qu.:45.00   3rd Qu.:150.0  
#>  Max.   :88.70   Max.   :98.00   Max.   :600.0

4.3 Modeling Approach: Linear Model

4.3.1 Model

For the purpose of this lecture our variable of interest is the television consumption of the respondents. We transform the original variable tv so that the new variable, tv_hour, represents consumption in hours:

allbus$tv_hour <- round(allbus$tv/60)
hist(allbus$tv_hour, xlab="TV Hours per Day", main = "Histogram of TV Hours")

As independent variables we include worktime, sex, age, and loginc, which is the log of the income. The reason of the log-transformation lies in the negative skewness of the income. By taking the logarithm we obtain a more normally distributed covariate.

The equation of the linear model we want to estimate looks as follows:

\[ \text{tv_hour}_i = \beta_0 + \beta_1 \text{worktime}_i + \beta_2 \text{sex}_i + \beta_3 \text{age}_i + \beta_4 \text{loginc}_i + \varepsilon_i \]

4.3.2 Results

4.3.2.1 Regression

In the following you find the results after performing the regression in R:

allbus$loginc <- log(allbus$di07)
lm_tv <- lm(tv_hour ~ worktime + sex + age + loginc, data = allbus)
summary(lm_tv)
#> 
#> Call:
#> lm(formula = tv_hour ~ worktime + sex + age + loginc, data = allbus)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -2.3156 -0.7512  0.0553  0.4139  7.9919 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  2.023719   0.399617   5.064 4.59e-07 ***
#> worktime    -0.001473   0.002814  -0.524   0.6007    
#> sexfemale    0.026092   0.060876   0.429   0.6683    
#> age          0.016262   0.002418   6.724 2.47e-11 ***
#> loginc      -0.117285   0.055773  -2.103   0.0356 *  
#> ---
#> Signif. codes:  
#> 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 1.103 on 1558 degrees of freedom
#> Multiple R-squared:  0.03022,    Adjusted R-squared:  0.02773 
#> F-statistic: 12.14 on 4 and 1558 DF,  p-value: 1.016e-09

Examining the p-values, we find that the intercept, age, and loginc have statistically significant coefficients. Overall, the model explains approximately 3% of the variation in the response variable.

However, an analysis of the residuals reveals that the first and third quartiles are not symmetrically distributed around zero. This suggests that the residuals—and consequently, the response variable—do not follow a normal distribution. As a consequence, inference based on this model may not be reliable.

4.3.2.2 Prediction

We now plug the estimated model and different values for the covariates into the predict() command to make predictions for the daily hours of tv consumption. This function assumes a normally distributed response with the expectation coming from the linear model and the variance being constant: \[ Y \sim \mathcal{N}(\mathbf{X} \boldsymbol{\beta}, \sigma^2), \quad Y \in \mathbb{R} \]

predict(lm_tv, 
        newdata = list(worktime = 0, sex = "female", 
                       age = 56, loginc = 2))
#>        1 
#> 2.725931

predict(lm_tv, 
        newdata = list(worktime = 20, sex = "male", 
                       age = 26, loginc = 4))
#>        1 
#> 1.947933

predict(lm_tv, 
        newdata = list(worktime = 140, sex = "male", 
                       age = 16, loginc = 20))
#>          1 
#> -0.2680488

For 56 year old women with zero hours work time per week and a log-income of 2, the expected daily TV consumption is 2.73 hours. Further, for 26 year old men with 20 weekly working hours and a log income of 4 we expect a TV consumption of 1.95 hours per day. However, when we take an out of sample example that is 16 years old, we expect -0.27 hours of watching TV per day which is not a meaningful value.

This already gives us a first guess that our model fails when it comes to predictions. We further examine this by comparing the kernel density estimate of the empirical TV hours data to the predicted values from our linear model.

plot(density(allbus$tv_hour), main = "Density of TV Hours")
lines(density(predict(lm_tv)), col = "grey")

The real data distribution is multimodal with peaks every half hour while the predicted distribution is more concentrated with only one peak. It fails to capture the true variability, which highlights the mismatch between model and reality.

4.3.3 Problem of Linear Models

The problem of linear models is that the conditional distribution is fixed as a normal distribution with a fixed variance: \[Y_i \overset{iid}{\sim} N(\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_p x_{ip} ,\sigma^2)\]

However, this assumption restricts the flexibility of the model, making it unsuitable for different types of variables or data distributions. To address this issue, the task is to adjust the conditional distribution to better fit the data.

4.4 Generalized Linear Models

4.4.1 Construction

For the construction of a generalized version of our linear regression model, we have three requirements. It should…

… capture the domain of \(\mathbf{y}|\mathbf{X}\).
… still keep the whole information of the covariates.
… still use the closed form distributions (to be able to do inference).

4.4.2 Idea

The general idea is to use a conditional distribution that fits the experiment and estimate \(Y|X \sim D(d(x))\), where \(d(x)\) is some response function of the covariates.

To summarize, the three steps you have to think about are:

Find a distribution that fits the response.
Choose a response function \(h\) for the expected value of the response \(E(y) = h(f(x))\) or a link function \(g\) such that \(g(E(y)) = f(x) = \mathbf{X}\boldsymbol{\beta}\).
Ensure that link and response functions are inversely related: \(g^{-1} = h\).

4.5 Modeling Approach: Generalized Linear Model

4.5.1 Distribution and Link Function

Going back to the Allbus example, we now think of a distribution that fits the data.

Since the data is discrete count data with values greater than or equal to zero we propose modeling it using the Poisson distribution. Thus, we need a response function that makes the linear regression equation positive: \(h(f(x)) > 0\). Potential choices for such a function include the absolute value, the square, or the exponential function, with the latter being the standard choice: \[\hat{y}_{Po} = \exp(\hat{\beta_0} + \hat{\beta_1}x_1+\dots)\] or in the case of our example \[\widehat{tv\_hour_i} = \exp(\hat{\beta_0} + \hat{\beta_1}worktime_i + \hat{\beta_2}sex_i + \hat{\beta_3}age_i + \hat{\beta_4}loginc_i)\]

4.5.2 Results

To estimate this model we use the glm() function in R with option family = "poisson":

glm_tv <- glm(tv_hour ~ worktime + sex + age + loginc, 
              family = "poisson", data = allbus)
summary(glm_tv)
#> 
#> Call:
#> glm(formula = tv_hour ~ worktime + sex + age + loginc, family = "poisson", 
#>     data = allbus)
#> 
#> Coefficients:
#>               Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)  0.7004645  0.2646700   2.647  0.00813 ** 
#> worktime    -0.0007592  0.0018572  -0.409  0.68270    
#> sexfemale    0.0138623  0.0405453   0.342  0.73243    
#> age          0.0088624  0.0016317   5.431 5.59e-08 ***
#> loginc      -0.0636080  0.0369873  -1.720  0.08548 .  
#> ---
#> Signif. codes:  
#> 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for poisson family taken to be 1)
#> 
#>     Null deviance: 1239.9  on 1562  degrees of freedom
#> Residual deviance: 1207.9  on 1558  degrees of freedom
#> AIC: 4794.5
#> 
#> Number of Fisher Scoring iterations: 5

Due to the response functions, the parameters are interpreted differently in this model. Specifically, increasing the value of covariate \(x_j\) by one unit results in a multiplicative change of \(\exp{\hat{\beta}_j}\). For example, aging by one year increases the expected daily TV consumption by the factor \(\exp{(0.0088624) = 1.0089}\) ceteris paribus.

Derivation

We have a simple regression equation: \[ \hat{y} = \exp{(\hat{\beta}_0+\hat{\beta}_1x_1+\hat{\beta}_2x_2+\hat{\beta}_3x_3+\hat{\beta}_4x_4)} \] Increasing, for instance, the first covariate by one unit leads to the following estimation: \[ \begin{aligned} \hat{\tilde{y}} &= \exp{(\hat{\beta}_0+\hat{\beta}_1(x_1+1)+\hat{\beta}_2x_2+\hat{\beta}_3x_3+\hat{\beta}_4x_4)} \\ &= \exp{(\hat{\beta}_0+\hat{\beta}_1x_1+\hat{\beta}_2x_2+\hat{\beta}_3x_3+\hat{\beta}_4x_4)}*\exp{(\hat{\beta}_3)} \\ &= \hat{y}*exp{(\hat{\beta}_3)} \end{aligned} \]

4.6 Exponential Family

4.6.1 Probability Density Function

Generalized Linear Models, as implemented above, are possible for a certain class of distributions that belong to the (single-parameter) exponential family. A distribution of random variable \(Y\) belongs to the single-parameter exponential family (\(Y \sim EF\)) if the density can be written as \[ f_Y(y|\theta) = \exp \left(\frac{y\theta - b(\theta)}{\phi} \omega + c(y, \phi, \omega)\right) \]

Remarks:

The parameter \(\theta\) is called the canonical parameter.
The function \(b(\theta)\) has to be constructed such that \(f(y|\theta)\) can be normed and the derivations \(b'(\theta)\) and \(b''(\theta)\) exist.
The function \(c(y, \phi, \omega)\) does not rely on \(\theta\).
The parameter \(\phi\) is called the dispersion parameter.
\(\omega\) is known (usually some kind of weight).

4.6.2 Moments

It can be shown that \(\mathbb{E}[y] = b'(\theta)\) and \(\text{Var}[y] = \frac{\phi}{\omega}b''(\theta)\).

Derivation: Expected Value

First, taking the first derivative of the exponential family distribution function w.r.t. \(\theta\) yields: \[ \frac{d}{d\theta} f_Y(y|\theta) = f_Y(y|\theta) * \left(\frac{y - b'(\theta)}{\phi} \omega \right) \] Now, we can integrate out \(y\) on both sides of the equation. By following Leibniz’s rule, we obtain for the left-hand side: \[ \int \frac{d}{d\theta} f_Y(y|\theta) \hspace{0.25em}dy = \frac{d}{d\theta} \underbrace{\int f_Y(y|\theta)\hspace{0.25em}dy}_{=1} = 0 \] For the right-hand side: \[ \int f_Y(y|\theta) * \left(\frac{y - b'(\theta)}{\phi} \omega \right) dy \\ = \frac{\omega}{\phi} \int f_Y(y|\theta) * \left(y - b'(\theta)\right) dy \\ = \frac{\omega}{\phi} \left( \int f_Y(y|\theta) * y \enspace dy - \int f_Y(y|\theta) * b'(\theta) \hspace{0.25em} dy \right) \\ = \frac{\omega}{\phi} \left(\mathbb{E}[y] - b'(\theta ) \underbrace{\int f_Y(y|\theta) \hspace{0.25em}dy}_{=1} \right) \\ = \frac{\omega}{\phi} \left(\mathbb{E}[y] - b'(\theta )\right) \] Finally, we can find the value of the expectation by equating both sides of the equation: \[ 0 = \frac{\omega}{\phi} \left(\mathbb{E}[y] - b'(\theta )\right) \\ \Leftrightarrow \mathbb{E}[y] = b'(\theta) \quad \text{q.e.d.} \]

Derivation: Variance

The first step is to take the second derivative of the exponential family distribution function w.r.t. \(\theta\): \[ \begin{align} \frac{d^2}{d\theta^2} f_Y(y|\theta) &= \frac{d}{d\theta} \left[ f_Y(y|\theta) * \left(\frac{y - b'(\theta)}{\phi} \omega \right) \right] \\ &= f_Y(y|\theta) * \frac{- b''(\theta)}{\phi} \omega + f_Y(y|\theta) * \left(\frac{y - b'(\theta)}{\phi} \omega \right) * \left(\frac{y - b'(\theta)}{\phi} \omega \right) \\ &= f_Y(y|\theta) * \left[ \left(y - b'(\theta)\right)^2 \frac{\omega^2}{\phi^2} - \frac{b''(\theta)}{\phi} \omega \right] \end{align} \]

Now we integrate out \(y\) on both sides of the equation. Firstly, applying Leibniz’s rule lets us simplify the left-hand side as: \[ \int \frac{d^2}{d\theta^2} f_Y(y|\theta)\hspace{0.25em} dy = \frac{d^2}{d\theta^2} \underbrace{\int f_Y(y|\theta)}_{=1}\hspace{0.25em} dy = 0 \] Secondly, for the right-hand side we obtain the following result: \[ \int f_Y(y|\theta) * \left[ \left(y - b'(\theta)\right)^2 \frac{\omega^2}{\phi^2} - \frac{b''(\theta)}{\phi} \omega \right]\hspace{0.25em} dy \\ = \int f_Y(y|\theta) * \left(y - b'(\theta)\right)^2 \frac{\omega^2}{\phi^2} \hspace{0.25em} dy - \int f_Y(y|\theta) * \frac{b''(\theta)}{\phi} \omega \hspace{0.25em} dy \\ = \frac{\omega^2}{\phi^2} \int f_Y(y|\theta) * \left(y - b'(\theta)\right)^2 \hspace{0.25em} dy - \frac{b''(\theta)}{\phi} \omega \underbrace{\int f_Y(y|\theta)\hspace{0.25em} dy}_{=1} \\ = \frac{\omega^2}{\phi^2} \text{Var}[y] - \frac{b''(\theta)}{\phi} \omega \] Equating the results for both sides gives us: \[ 0 = \frac{\omega^2}{\phi^2} Var[y] - \frac{b''(\theta)}{\phi} \omega \\ \Leftrightarrow \text{Var}[y] = \frac{\phi}{\omega}b''(\theta) \quad \text{q.e.d.} \]

4.6.3 Examples

Some exemplary distributions that belong to the exponential family are the (Inverse) Gaussian, Exponential, (Inverse) Gamma, Beta, Bernoulli, (Negative) Binomial, and the Poisson distribution.

Example: Gaussian Distribution

The probability density function of the normal distribution is defined as: \[ f_Y(y \mid \mu, \sigma) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left( -\frac{(y - \mu)^2}{2\sigma^2} \right) \] We assume \(\sigma^2\) as constant and \(\mu\) as being the only unknown parameter. Now, putting everything into the exponential term yields: \[ f_Y(y \mid \mu) = \exp\left( -\frac{(y - \mu)^2}{2\sigma^2} + \log{\frac{1}{\sqrt{2\pi\sigma^2}}}\right) \]

Next, multiplying out the square in the numerator yields: \[ f_Y(y \mid \mu) = \exp\left( -\frac{y^2 - 2y\mu + \mu^2}{2\sigma^2} + \log{\frac{1}{\sqrt{2\pi\sigma^2}}}\right) \] Rearranging and simplifying terms gives us: \[ f_Y(y \mid \mu) = \exp\left(\frac{y\mu - \frac{1}{2}\mu^2}{\sigma^2} - \frac{y^2}{2\sigma^2} + \log{\frac{1}{\sqrt{2\pi\sigma^2}}}\right) \] This equation is already sufficient to prove that the Gaussian distribution comes from the exponential family with parameters:

\(\theta = \mu\)
\(b(\theta) = \frac{1}{2}\mu^2\), \(b'(\mu) = \mu\), \(b''(\mu) = \sigma^2\)
\(\phi = \sigma^2\)
\(\omega = 1\)
\(c(y, \phi, \omega) = - \frac{y^2}{2\sigma^2} + \log{\frac{1}{\sqrt{2\pi\sigma^2}}}\)

Example: Poisson Distribution

The probability density function of the poisson distribution is defined as: \[ f_Y(y \mid \lambda) = \frac{\exp(-\lambda) \lambda^y}{y!} \] Putting everything into the exponential term and rearranging yields: \[ f_Y(y \mid \lambda) = \exp\left(y\log\lambda -\lambda - \log{y!}\right) \] This equation is already sufficient to prove that the Gaussian distribution comes from the exponential family with parameters:

\(\theta = \log\lambda\)
\(b(\theta) = \exp(\log\lambda) =\lambda\), \(b'(\theta) = \lambda\), \(b''(\theta) = \lambda\)
\(\phi = 1\)
\(\omega = 1\)
\(c(y, \phi, \omega) = - \log{y!}\)

4.7 General Definition

For all distributions in the single-parametric exponential family, we can define the general model class of Generalized Linear Models (GLM):

Distributional Assumption: \[y_i | \boldsymbol{x}_i\ \stackrel{iid}{\sim} \text{EF} \quad \text{with} \quad \mu_i = \mathbb{E}(y_i | \boldsymbol{x}_i)\]
Structural Assumption: The expectation \(\mu_i\) is linked to the linear predictor \(\eta_i\) by a (reversible and twice differential) response function \(h\): \[\mu_i = h(\eta_i) \quad \text{with} \quad \eta_i = \boldsymbol{x}_i^\top \boldsymbol{\beta} = \beta_0 + \beta_1 x_{i1} + \ldots + \beta_k x_{ik}\] This renders possible a unified theory for parameter estimation and hypothesis testing and a unified estimation framework in statistical programming.

3 Linear Regression

5 Bayesian Inference

Chair of Spatial Data Science and Statistical Learning

Introduction to Bayesian Inference and Statistical Learning