Chapter 2 Methods
2.1 Coding Process
A large portion of this project was spent researching and learning R codes to be able to run necessary processes. There were multiple resources that went into a better understanding of R in the beginning; however, further research was required in order to run more complexed models towards the end. First, when completing simpler data management tasks, various resources, including previous Linear Regression projects, were picked through in order to complete tasks such as creating categorical variables for disturbances or splitting the starting time column into two to get rid of the date. After the data management was complete, research was done to create the negative binomial regression model as well as the corresponding tables to display the results. UCLA’s negative binomial regression guide and example was heavily used for constructing all the negative binomial regression models through the glm.nb function [6]. Once a better understanding of the function, all the models were able to be completed and then used to form the resulting tables.
2.2 Data Extraction Process
The research team provided a large dataset, Waterfowl Data, which included all birds and all the observed variables. However, for the needed models, all the data include in the Waterfowl data was not need which required specific pieces of data to be extracted and placed in smaller datasets. The specific data extraction was to be completed once the data management phase was completed due to having to create the categorical variables for human and nature disturbances as well as location.
Once the data management was complete, based on the research team’s desired predictors, two different smaller data set were created for the models. The first one was created based on the date Id and included the sums of following predictors: total birds, species of interest, disturbances (total, human, and nature), and all the individual birds. The average temperature for each date Id was also added to this data set. The second data set was created using the exact same process as the one based on date id; however, this data set included the location variable. After these two data sets were created, the negative binomial regression was able to pull the data from the smaller data sets.
2.3 Variables
The research team desired to look at the abundance of redhead, scaup and bufflehead ducks as a function of the following variables: weather (temperature, water conditions, and wind speed), disturbances (human-induced vs. natural), area (Perdido Bay vs. Santa Rosa), and date of observation.
• Outcome Variables:
o Birds – The total amount of birds that were observed by the research team.
o Species of Interest – The total amount of redhead, scaup, and bufflehead that were observed.
o Redhead – A medium-sized North American diving duck. Males have a cinnamon-red head and gray body. Females have a place face and overall brown body.
o Bufflehead – A small sea duck with a large head that quicky vanishes and resurfaces when needing food. Males have a white and black body with hints of green and purple. Females have gray and brown bodies with white patches on their cheeks.
o Scaup – A medium-sized diving duck. Males have a gray back and white sides as well as a black head with purplish-iridescent. Females have brown bodies with white patches around their bill.
• Predictor Variables:
o Temperature – The temperature from that was observed for each date that the observation took place.
o Water Conditions – The amount of choppiness that was occurring in the water the day of the observations. On a scale 1- 4 with 1 being calm and 4 being a rough chop.
o Wind Speed – The average windspeed the day of the observation.
o Disturbances – Any factors that will disrupt the observations of waterfowl.
o Human Disturbances – Any factors caused by the presence of a human that would disrupt the observation of waterfowl.
o Nature Disturbance - Any factors not caused by the presence of a human that would disrupt the observation of waterfowl
o Location – The location of the observation. The range of location would be categorized into either Perdido Bay or Santa Rosa.
o Date Id – The date the observation took place.
2.4 Statistical Methods
Looking at what the research team wanted to discover through the data and how to best model the data, the best regression model had to be chosen. The first regression model to be looked at was a multiple linear regression model. This model is the basis is of other models which is why this was the first place to start. The model would have met the requirements of finding the linear regression between our outcome variables and predictor variables. However, the main problem with using this model is that the residuals must be normally distributed with a mean of zero which this was not the case for our analysis of the data.
Moving forward, the next model that was look at was the Poisson Distribution. This type of distribution can be used to find the probability of the number of the events in a certain time period or finding the probability of the waiting time for the next event to occur. This type of model was looked to be great since the exact timing of the observation of birds would be not known and this follows Poisson distribution model. However, looking at the data that was given, it was learned that the data was over-dispersed, and Poisson was not the best fit for creating our models.
After finding out the data was over was over-dispersed which means that the variance is much higher than the mean value, the best choice for building our models was negative binomial regression. This type of regression does not make the assumption that the variance is equal to the mean whereas the Poisson regression model does Instead, the negative binomial model requires for us to define a new parameter a which is used to express the variance in terms of the means which looks like: variance =mean+a * 〖mean〗^p. Negative binomial regression is used for usually for over-dispersed count outcome variables which is directly connected to the outcome variables that are being used in the analysis. The outcome variables used in the models are all bird count sums based on total or specific birds. Where the outcome count variables may have become over-dispersed is through the way the researchers counted them. The researchers estimated the count of birds that they saw when the number was over 100 and did a block count for a larger group of ducks. These estimated counts of waterfowl could have caused the data to become over-dispersed making it a great candidate for negative binomial regression.