Chapter 8 Linear regression, model selection, and prediction in R

by Heather Kropp for ENVST 325: Introduction to Environmental Data Science Hamilton College

8.1 Learning objectives

Conduct multiple linear regression in R
Transform variables for linear analysis
Consider model selection
Make predictions from a regression model
Analyze greenhouse gas emissions data from anthropogenic driven sources

8.2 New functions & syntax

log,unique,lm, summary, qqnorm, qqline, rstandard, fitted.values, abline, chart.Correlation, ols_step_forward_aic, predict

8.3 The problem: greenhouse gas emssisions from anthropogenic land use

There are many areas around the world that rely on surface water in rivers and lakes for a drinking and irrigation source. Man-made reservoirs are a common water management approach that involves constructing a dam to slow the flow of water in a river so that a lake may form upstream of the dam. This lake formation can ensure a stable water source for drinking and irrigation and electricity. However, reservoirs involve inundating areas of land that would not have been covered in water without a dam. Many of these areas would previously be covered with vegetation with soils rich in organic matter. There is an increasing concern that reservoirs can be sources of greenhouse gases, especially methane. Methane is a highly potent greenhouse gas. Methane is produced by bacteria in soils saturated with water with little oxygen. The flooded vegetation and organic rich soils offer ample food sources for bacteria and the low oxygen water environment is expect to promote greater production of methane than if the land was left unflooded. A new study by Deemer et al. compiled measurements of greenhouse gases taken from reservoirs around the world to examine the potential for reservoirs to be a source of greenhouse gas emissions. In this exercise, you will examine how much methane global reservoirs produce and examine the characteristics that drive greater methane production.

A dam helps form Lake Vyrnwy, a man-made reservoir. Source: Ray Jones CC2

8.3.1 The data: greenhouse gases and reservoir characteristics

The data collected by Deemer is a synthesis data set. A synthesis aim to compile data collected from a range of different studies to draw conclusions over a broader spatial scale with greater replication than the individual studies alone. However, synthesis often have to merge different observation approaches and some data may be missing if the original study did not collect it.

The Deemer_GHG_Data.csv file contains the greenhouse gas and reservoir characterstic data from sites shown in the map above.

# read in greenhouse gas data from reservoirs
ghg <- read.csv("/cloud/project/activity05/Deemer_GHG_Data.csv")

           System Latitude Longitude Amazon    Region       Country Alpine
1 Acton Reservoir    39.57    -84.73     no Temperate United States       
2       Al-Wihdeh    32.73     35.87     no Temperate  Jordan/Syria       
3      Alamo Lake    34.25   -113.50     no Temperate United States       
4       Allatoona    34.14    -84.64     no Temperate United States       
5   Alouette Lake    49.32   -122.43     no Temperate        Canada       
6        Angerman    63.19     17.23     no    Boreal        Sweden       
  hydropower           Use   ch4     co2 age chlorophyll.a mean.depth
1         no    Recreation    NA   31.80  50         59.00        3.9
2    unknown                  NA 1445.45   3            NA         NA
3         no Flood Control    NA -199.22  35          3.58         NA
4        yes               22.12  446.38  63          4.40        9.2
5        yes                  NA -111.00  74            NA         NA
6        yes                  NA   30.80  70            NA         NA
  surface.area volume Residence.Time..days.    DIP precipitation runoff
1         2.50     NA                   292  21.64          1039    459
2         0.82   0.01                   247 138.71           349    124
3         4.22     NA                    NA   0.07           208      0
4        49.00   0.45                    36  43.84          1339    468
5        16.00     NA                    NA  40.14          1769   1198
6       854.00     NA                    NA   1.95           592    279
  airTemp
1   10.61
2   19.60
3   20.09
4   15.37
5    8.24
6    3.07

There are a few main columns of data to focus on here. ch4 describes the methane flux (\(mg CH_4-C\) \(m^{-2}\) \(day^{-1}\)) in carbon equivalents emitted from the reservoir. The co2 column gives the carbon dioxide flux (\(mg CO_2-C\) \(m^{-2}\) \(day^{-1}\)) in carbon equivalents. If a carbon dioxide flux is negative, that means that the lake takes in more carbon dioxide than it emits. There are explanatory variables that also can influence the release of greenhouse gases. The depth of the reservoir (mean.depth, meters) can affect the exchange of oxygen, water chemistry, and microclimate. The age of the reservoir (age, years) can influence the availability of carbon and oxygen newly flooded vegetation and soils for microbes. The chlorophyll A measurements indicate how much photosynthesizing organisms are living in the water (chlorophyll.a in \(\mu g L^{-1}\)). The surface area (surface.area in \(km^2\)) and volume (volume in \(km^3\)) can influence water conditions. The following are all variables modeled from global data that were used by the authors: dissolved inorganic phosphorous (DIP \(kg ~km^{-2}~ yr^{-1}\)) influences the growth of microbes and photosynthetic organisms. Precipitation influences the flow of new water into the reservoir (precipitation, \(mm~ yr^{-1}\)), runoff is the total amount of water from precipitation that flows into a drainage basin (runoff, \(mm~ yr^{-1}\)), and the air temperature influences the conditions for microbial and aquatic organisms as well as water chemistry (airTemp, \(^\circ C\)).

You will also find it helpful to install and load the following packages:

library(dplyr)
library(ggplot2)
library(olsrr)
library(PerformanceAnalytics)

8.4 The approach: Linear regression in R

8.4.1 Transforming variables

Methane emissions around bodies of water are a type of observation that tends to have many observations at low values with occasional very high value. The histogram of the methane data below confirms this.

Keep in mind that the linear assumption for the ordinary least squares regression is: the parameters are linear, describing a linear relationship between two variables. This assumption simply gives the structure for \[y = \beta_0 + \beta_1 * x\]. However, either y or x does not have to be on a linear scale. In fact, y or x can be transformed to allow the data to be described on a linear scale. The most common type of transformation is a log-scale transformation. In R, the function log takes the natural log of data. The natural log often explains more relationships in environmental data than log base 10. Let’s transform the methane data before working with it. For the methane data, values of zero exist and there is no natural log of zero. However, including adding a value of one will shift all the data by a value, and any value that was zero on the linear scale will be zero on the log scale log(1) = 0.

# log transform methane fluxes
ghg$log.ch4 <- log(ghg$ch4+1)

You can see that the log scale results in a linear relationship between temperature and methane. Now the interpretation of the data in that linear relationship changes. As air temperature increases by one value, we are now looking at the increase in methane by one value on the log(ch4+1) scale.

Variables like total annual precipitation, age of the reservoir, and dissolved inorganic phosphorous would also be good values to transform due to their distribution and wide range of values.

ghg$log.age <- log(ghg$age)
ghg$log.DIP <- log(ghg$DIP+1)
ghg$log.precip <- log(ghg$precipitation)

8.4.2 Binary variables

It can be useful to control for categorical binary variables in a regression. A slope coefficient for these variables represents an addition or subtraction (if negative) to the intercept. This can account for an expectation for values of data being shifted higher or lower within this group category. There are a few variables where this might be helpful. Let’s use the unique function to look at all of the regions. Unique will print out each unique object in a vector with no duplicates.

unique(ghg$Region)

[1] "Temperate"   "Boreal"      "Subtropical" "Tropical"

ifelse will be helpful here to make a binary variable (just like making a flag in the last tutorial). It will be helpful to isolate two regions. Boreal regions are known for their methane producing wetlands with organic rich soils and vegetation adapted to flooding, and thus we might expect that the flooded regions and the vegetation around a reservoir may be more prone to producing methane. We will also isolate tropical regions. There are studies that indicate that tropical lakes emit more methane so we might also expect emissions to differ from subtropical and temperature regions.

# binary variable for boreal region
ghg$BorealV <- ifelse(ghg$Region == "Boreal",1,0)
# binary variable for tropical region
ghg$TropicalV <- ifelse(ghg$Region == "Tropical",1,0)

We will also want to create a variable for alpine regions (high elevation). Alpine regions are known for long term snow cover and glaciers that bring high amounts of cold, nutrient depleted melt water.

# binary variable for alpine region
ghg$AlpineV <- ifelse(ghg$Alpine == "yes",1,0)

Finally we will also add a variable that indicates that the reservoir is knowingly maintained for hydropower usage.

# binary variable for known hydropower
ghg$HydroV <- ifelse(ghg$hydropower == "yes",1,0)

8.4.3 Running a multiple regression

In R, the lm function allows you to specify a linear model and will automatically run an ordinary least squares regression. Here, we will focus on a multiple regression. The function can be easily altered to include only a single variable for a simple linear regression. The lm function expects a specific order of arguments where y ~ x1 + x2 where y is the dependent variable, and x is one or more independent variables. Let’s set up a multiple regression for our methane fluxes that looks at the main variables that I predict will have the greatest impact: air temperature, reservoir age, reservoir depth, log-scale dissolved inorganic phosphorous, precipitation, and presence in a boreal region.

# multiple regression
# creates a model object
mod.full <- lm(log.ch4 ~ airTemp+
                log.age+mean.depth+
                log.DIP+
                log.precip+ BorealV, data=ghg) #uses the data argument to specify dataframe

You can see the full model results using the summary function and referring to the model name. This will give you key information like the coefficients table and r-squared.

summary(mod.full)


Call:
lm(formula = log.ch4 ~ airTemp + log.age + mean.depth + log.DIP + 
    log.precip + BorealV, data = ghg)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.1437 -0.6877 -0.0805  0.6668  4.1080 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  8.413344   2.245865   3.746 0.000337 ***
airTemp      0.136628   0.018754   7.285 2.00e-10 ***
log.age     -0.161456   0.125556  -1.286 0.202177    
mean.depth  -0.010156   0.004548  -2.233 0.028336 *  
log.DIP      0.431204   0.141994   3.037 0.003227 ** 
log.precip  -1.166775   0.339196  -3.440 0.000927 ***
BorealV      1.936664   0.460385   4.207 6.72e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.31 on 80 degrees of freedom
  (180 observations deleted due to missingness)
Multiple R-squared:  0.5082,    Adjusted R-squared:  0.4714 
F-statistic: 13.78 on 6 and 80 DF,  p-value: 1.091e-10

One thing to note is that almost half of the data was omitted due to a missing value in at least one of the columns included.

Assuming a standard confidence level of \(\alpha\)= 0.05, you can see that some variables are related to methane fluxes but others do not have the relationship that I predicted.

8.4.4 Checking assumptions

Before moving on to closely interpreting this regression, it is a good idea to check the main assumptions of the regression:

Relationships are linear
Residuals are normally distributed
Residuals have equal variance
Residuals are independent and random
In a multiple regression, independent variables are not highly correlated (multicollinearity)

It will be helpful to isolate standardized residuals (rstandard) and the fitted values from the regression line at each observation (fitted.values) for these tests.

res.full <- rstandard(mod.full)
fit.full <- fitted.values(mod.full)

8.4.4.1 1. Normality

You can use a qqplot to check normality of the residuals. For small data sets, the Shapiro-Wilks test (shapiro.test) can also be used to check normality.

# qq plot
qqnorm(res.full, pch=19, col="grey50")
qqline(res.full)

# shapiro-wilks test
shapiro.test(res.full)


    Shapiro-Wilk normality test

data:  res.full
W = 0.98232, p-value = 0.2817

You can see there are a few values that aren’t perfect but the data mainly follows the line. You can also see that the shapiro.test indicates the residuals do not significantly deviate from a normal distribution.

8.4.4.2 2-4. Residuals

Many of the issues outlined in assumptions 2-4 can be seen using a residual plot and background knowledge about the data. A residual plot simpily uses the fitted values and standard residuals. You can add a horizontal line to the plot at zero for easier comparison.

plot(fit.full,res.full, pch=19, col="grey50")
abline(h=0)

Here the residuals are not in major violation of any assumptions.

8.4.5 5. Multicollinearity

The assumption of multicollinearity applies to multiple regressions only. Reliable slope coefficients assume that each independent variable is not related to other independent variables in the model. You can check this easily using the chart.Correlation function in the PerformanceAnalytics package. You simply need a data frame of just the variables to compare.

# isolate continuous model variables into data frame:

reg.data <- data.frame(ghg$airTemp,
  ghg$log.age,ghg$mean.depth,
  ghg$log.DIP,
  ghg$log.precip)

# make a correlation matrix 
chart.Correlation(reg.data, histogram=TRUE, pch=19)

Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter
Warning in par(usr): argument 1 does not name a graphical parameter

A correlation coefficient varies from 0-1 with 1 indicated a perfect correlation. There is no established threshold for multicollinearity, but some studies indicate that correlation coefficients greater than 0.7-0.8 indicate that both variables are too correlated to include. The solution to this is to choose just one variable knowing that it may represent other underlying variables and processes (include this in your interpretation). The other solution is to create an index that represents both variables.

Here, there are a few variables that have a fair amount of correlation, but there is not a severe enough correlation to cause concern here.

8.5 Model selection

Model selection is one of the most challenging parts of designing an analysis. My selection of initial variables related to my biophysical understanding of microbial respiration, aquatic ecosystems, and microclimate. However, you can see that some variables are simply not significant and indicate no relationship. This outcome would be useful to report for inference. Understanding which variables influence methane emission and by how much provides more information about when and why reservoirs may be significant sources of methane. However, if I am interested in prediction, I will only want to include significant variables that greatly describe the variability in methane to reduce data collection and issues with over-fitting. Model selection is the process of eliminating variables and parameters that are not critical to the objective of the analysis and the outcomes of the model.

There are various ways of assessing model performance. The \(R^2\) gives information on the amount of variability in y explained by the model. The adjusted \(R^2\) penalizes the addition of extra independent variables. The Akaike information criterion (AIC) is a measure of model fit. The smaller the value, the better the model fit. Comparing these measures between candidate models, a series of possible models to compare. Model selection can help with statistically selecting variables that best describe the independent variable without adding unnecessary variables.

You can assess the importance of your variables in a multiple regression using stepwise selection. Stepwise selection iteratively adds (forward) or removes (backward) variables to calculate their impact on the model. Let’s look at forward stepwise selection using the function ols_step_forward_aic function from the olsrr package. The only argument needed is your full model.

# run stepwise
full.step <- ols_step_forward_aic(mod.full)
# view table
full.step


                             Stepwise Summary                              
-------------------------------------------------------------------------
Step    Variable        AIC        SBC       SBIC        R2       Adj. R2 
-------------------------------------------------------------------------
 0      Base Model    352.361    357.293    104.103    0.00000    0.00000 
 1      airTemp       331.606    339.003     83.302    0.23015    0.22109 
 2      BorealV       314.366    324.230     66.635    0.38289    0.36819 
 3      log.precip    310.582    322.911     63.085    0.42258    0.40171 
 4      log.DIP       305.395    320.190     58.559    0.46836    0.44243 
 5      mean.depth    302.392    319.653     56.274    0.49807    0.46709 
-------------------------------------------------------------------------

Final Model Output 
------------------

                         Model Summary                          
---------------------------------------------------------------
R                       0.706       RMSE                 1.269 
R-Squared               0.498       MSE                  1.731 
Adj. R-Squared          0.467       Coef. Var           52.761 
Pred R-Squared          0.393       AIC                302.392 
MAE                     0.963       SBC                319.653 
---------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                               ANOVA                                
-------------------------------------------------------------------
               Sum of                                              
              Squares        DF    Mean Square      F         Sig. 
-------------------------------------------------------------------
Regression    139.098         5         27.820    16.076    0.0000 
Residual      140.174        81          1.731                     
Total         279.271        86                                    
-------------------------------------------------------------------

                                  Parameter Estimates                                    
----------------------------------------------------------------------------------------
      model      Beta    Std. Error    Std. Beta      t        Sig      lower     upper 
----------------------------------------------------------------------------------------
(Intercept)     7.841         2.210                  3.548    0.001     3.443    12.238 
    airTemp     0.142         0.018        0.816     7.679    0.000     0.105     0.178 
    BorealV     2.065         0.451        0.467     4.578    0.000     1.168     2.963 
 log.precip    -1.158         0.340       -0.353    -3.401    0.001    -1.835    -0.480 
    log.DIP     0.399         0.140        0.283     2.845    0.006     0.120     0.679 
 mean.depth    -0.010         0.005       -0.174    -2.190    0.031    -0.019    -0.001 
----------------------------------------------------------------------------------------

# check full model
full.step$model


Call:
lm(formula = paste(response, "~", paste(preds, collapse = " + ")), 
    data = l)

Coefficients:
(Intercept)      airTemp      BorealV   log.precip      log.DIP  
   7.840642     0.141549     2.065391    -1.157859     0.399330  
 mean.depth  
  -0.009994

# plot AIC over time
plot(full.step )

THe main thing that you want to interpret here is how much the AIC and adjusted \(R^2\) change with each added variable. You can see here that the log(DIP+1) had almost no influence on the model despite being statistically significant. This information is helpful for both inference and prediction. Adding the air temperature variable increased the adjusted \(R^2\) and decreased the AIC by the greatest magnitude.

8.6 Predictions

Regression lines can be used to make predict for missing data. In R, the predict.lm function can be used to predict the value from the regression model for given input data values. Here, you need to specify the input model, input independent variables, and the interval to return. confidence returns the 95% confidence interval whereas prediction returns the interval for a single point of data. The prediction variable accounts for the extra variability of each regression point around the line (think about that error term in the regression equation).

# prediction with interval for predicting a point
predict.lm(mod.full, data.frame(airTemp=20,log.age=log(2),
                             mean.depth=15,log.DIP=3,
                             log.precip=6, BorealV=0),
           interval="prediction")

       fit      lwr      upr
1 5.174622 2.347429 8.001815

# look at prediction with 95% confidence interval of the mean

predict.lm(mod.full, data.frame(airTemp=20,log.age=log(2),
                                mean.depth=15,log.DIP=3,
                                log.precip=6, BorealV=0),
           interval="confidence")

       fit      lwr     upr
1 5.174622 4.081764 6.26748

8.7 Conclusions

You can see that reservoirs are a significant source of methane in some cases, but some reservoirs produce more methane than others. It seems that areas with high air temperatures, lower amounts of annual precipitation, or in boreal regions are prone to higher levels of methane. With this information, we can understand the impact of reservoirs on greenhouse gas emissions across the globe or predict the carbon footprint of a proposed reservoir. You will practice implementing, designing,and selecting linear models further in class and homework.

8.8 Citations

Deemer, Bridget R., et al. “Greenhouse gas emissions from reservoir water surfaces: a new global synthesis.” BioScience 66.11 (2016): 949-964.