3.4 Inference for model parameters
The assumptions introduced in the previous section allow to specify what is the distribution of the random vector \(\hat{\boldsymbol{\beta}}\). The distribution is derived conditionally on the sample predictors \(\mathbf{X}_1,\ldots,\mathbf{X}_n\). In other words, we assume that the randomness of \(\mathbf{Y}=\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\varepsilon}\) comes only from the error terms and not from the predictors. To denote this, we employ lowercase for the sample predictors \(\mathbf{x}_1,\ldots,\mathbf{x}_n\).
3.4.1 Distributions of the fitted coefficients
The distribution of \(\hat{\boldsymbol{\beta}}\) is: \[\begin{align} \hat{\boldsymbol{\beta}}\sim\mathcal{N}_{k+1}\left(\boldsymbol{\beta},\sigma^2(\mathbf{X}^T\mathbf{X})^{-1}\right) \tag{3.6} \end{align}\] where \(\mathcal{N}_{m}\) is the \(m\)-dimensional normal, this is, the extension of the usual normal distribution to deal with \(m\) random variables24. The interpretation of (3.6) is not so easy as in the simple linear case. Here are some broad remarks:
Bias. The estimates are unbiased.
Variance. Depending on:
- Sample size \(n\). Hidden inside \(\mathbf{X}^T\mathbf{X}\). As \(n\) grows, the precision of the estimators increases.
- Error variance \(\sigma^2\). The larger \(\sigma^2\) is, the less precise \(\hat{\boldsymbol{\beta}}\) is.
- Predictor sparsity \((\mathbf{X}^T\mathbf{X})^{-1}\). The more sparse the predictor is (small \(|(\mathbf{X}^T\mathbf{X})^{-1}|\)), the more precise \(\hat{\boldsymbol{\beta}}\) is.
The problem with (3.6) is that \(\sigma^2\) is unknown in practice, so we need to estimate \(\sigma^2\) from the data. We do so by computing a rescaled sample variance of the residuals \(\hat\varepsilon_1,\ldots,\hat\varepsilon_n\): \[\begin{align*} \hat\sigma^2=\frac{\sum_{i=1}^n\hat\varepsilon_i^2}{n-k-1}. \end{align*}\] Note the \(n-k-1\) in the denominator. Now \(n-k-1\) are the degrees of freedom, the number of data points minus the number of already fitted parameters (\(k\) slopes and \(1\) intercept). As in simple linear regression, the mean of the residuals \(\hat\varepsilon_1,\ldots,\hat\varepsilon_n\) is zero.
If we use the estimate \(\hat\sigma^2\) instead of \(\sigma^2\), we get more useful distributions, this time for the individual \(\beta_j\)’s: \[\begin{align} \frac{\hat\beta_j-\beta_j}{\hat{\mathrm{SE}}(\hat\beta_j)}\sim t_{n-k-1},\quad\hat{\mathrm{SE}}(\hat\beta_j)^2=\hat\sigma^2v_j^2\tag{3.7} \end{align}\] where \(t_{n-k-1}\) represents the Student’s \(t\) distribution with \(n-k-1\) degrees of freedom and \[ v_j\text{ is the }j\text{-th element of the diagonal of }(\mathbf{X}^T\mathbf{X})^{-1}. \] The LHS of (3.7) is the \(t\)-statistic for \(\beta_j\), \(j=0,\ldots,k\). They are employed for building confidence intervals and hypothesis tests.
3.4.2 Confidence intervals for the coefficients
Thanks to (3.7), we can have the \(100(1-\alpha)\%\) CI for the coefficient \(\beta_j\), \(j=0,\ldots,k\): \[\begin{align} \left(\hat\beta_j\pm\hat{\mathrm{SE}}(\hat\beta_j)t_{n-k-1;\alpha/2}\right)\tag{3.8} \end{align}\] where \(t_{n-k-1;\alpha/2}\) is the \(\alpha/2\)-upper quantile of the \(t_{n-k-1}\). Note that with \(k=1\) we have same CI as in (2.8).
Let’s see how we can compute the CIs. We return to the wine
dataset, so in case you do not have it loaded, you can download it here as an .RData
file. We analyze the CI for the coefficients of Price ~ Age + WinterRain
.
# Fit model
<- lm(Price ~ Age + WinterRain, data = wine)
mod
# Confidence intervals at 95%
confint(mod)
## 2.5 % 97.5 %
## (Intercept) 4.746010626 7.220074676
## Age 0.007702664 0.064409106
## WinterRain -0.001030725 0.002593278
# Confidence intervals at other levels
confint(mod, level = 0.90)
## 5 % 95 %
## (Intercept) 4.9575969417 7.008488360
## Age 0.0125522989 0.059559471
## WinterRain -0.0007207941 0.002283347
confint(mod, level = 0.99)
## 0.5 % 99.5 %
## (Intercept) 4.306650310 7.659434991
## Age -0.002367633 0.074479403
## WinterRain -0.001674299 0.003236852
In this example, the 95% confidence interval for \(\beta_0\) is \((4.7460, 7.2201)\), for \(\beta_1\) is \((0.0077, 0.0644)\) and for \(\beta_2\) is \((-0.0010, 0.0026)\). Therefore, we can say with a 95% confidence that the coefficient of WinterRain
is non significant. But in Section 3.1.1 we saw that it was significant in the model Price ~ Age + AGST + HarvestRain + WinterRain
! How is this possible? The answer is that the presence of extra predictors affects the coefficient estimate, as we saw in Figure 3.6. Therefore, the precise statement to make is: in the model Price ~ Age + WinterRain
, with \(\alpha=0.05\), the coefficient of WinterRain
is non significant. Note that this does not mean that it will be always non significant: in Price ~ Age + AGST + HarvestRain + WinterRain
it is.
Compute and interpret the CIs for the coefficients, at levels \(\alpha=0.10,0.05,0.01\), for the following regressions:
-
medv ~ . - lstat - chas - zn - crim
(Boston
) -
nox ~ chas + zn + indus + lstat + dis + rad
(Boston
) -
Price ~ WinterRain + HarvestRain + AGST
(wine
) -
AGST ~ Year + FrancePop
(wine
)
3.4.3 Testing on the coefficients
The distributions in (3.7) also allow to conduct a formal hypothesis test on the coefficients \(\beta_j\), \(j=0,\ldots,k\). For example the test for significance is specially important: \[\begin{align*} H_0:\beta_j=0 \end{align*}\] for \(j=0,\ldots,k\). The test of \(H_0:\beta_j=0\) with \(1\leq j\leq k\) is specially interesting, since it allows to answer whether the variable \(X_j\) has a significant linear effect on \(Y\). The statistic used for testing for significance is the \(t\)-statistic \[\begin{align*} \frac{\hat\beta_j-0}{\hat{\mathrm{SE}}(\hat\beta_j)}, \end{align*}\] which is distributed as a \(t_{n-k-1}\) under the (veracity of) the null hypothesis. \(H_0\) is tested against the bilateral alternative hypothesis \(H_1:\beta_j\neq 0\).
Remember two important insights regarding hypothesis testing.
In an hypothesis test, the \(p\)-value measures the degree of veracity of \(H_0\) according to the data. The rule of thumb is the following:
Is the \(p\)-value lower than \(\alpha\)?
- Yes \(\rightarrow\) reject \(H_0\).
- No \(\rightarrow\) do not reject \(H_0\).
The connection of a \(t\)-test for \(H_0:\beta_j=0\) and the CI for \(\beta_j\), both at level \(\alpha\), is the following.
Is \(0\) inside the CI for \(\beta_j\)?
- Yes \(\leftrightarrow\) do not reject \(H_0\).
- No \(\leftrightarrow\) reject \(H_0\).
The tests for significance are built-in in the summary
function, as we saw in Section 3. For mod
, the regression of Price ~ Age + WinterRain
, we have:
summary(mod)
##
## Call:
## lm(formula = Price ~ Age + WinterRain, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.88964 -0.51421 -0.00066 0.43103 1.06897
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.9830427 0.5993667 9.982 5.09e-10 ***
## Age 0.0360559 0.0137377 2.625 0.0149 *
## WinterRain 0.0007813 0.0008780 0.890 0.3824
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5769 on 24 degrees of freedom
## Multiple R-squared: 0.2371, Adjusted R-squared: 0.1736
## F-statistic: 3.73 on 2 and 24 DF, p-value: 0.03884
The unilateral test \(H_0:\beta_j\geq 0\) (respectively, \(H_0:\beta_j\leq 0\)) vs \(H_1:\beta_j<0\) (\(H_1:\beta_j>0\)) can be done by means of the CI for \(\beta_j\). If \(H_0\) is rejected, they allow to conclude that \(\hat\beta_j\) is significantly negative (positive) and that for the considered regression model, \(X_j\) has a significant negative (positive) effect on \(Y\). We have been doing them using the following rule of thumb:
Is the CI for \(\beta_j\) below (above) \(0\) at level \(\alpha\)?
- Yes \(\rightarrow\) reject \(H_0\) at level \(\alpha\). Conclude \(X_j\) has a significant negative (positive) effect on \(Y\) at level \(\alpha\).
- No \(\rightarrow\) the criterion is not conclusive.
With \(m=1\), the density of a \(\mathcal{N}_{m}\) corresponds to a bell-shaped curve With \(m=2\), the density is a surface similar to a bell.↩︎