Generalised Linear Model : BCCVL (Sandpit)

Introduction

Generalized Linear Models (GLM) are an extension of ‘simple’ linear regression models, which predict the response variable as a function of multiple predictor variables. Linear regression models work on the assumption that the data are normally distributed, which implies that a constant change in a predictor leads to a constant change in the response variable. This assumption is often violated in ecological data, and therefore these models are extended into GLMs to be able to deal with non-normal distributed data.

GLMs find the equation that best predicts the occurrence of a species for the values of the environmental variables. The model has three important components: 1) the probability distribution of the response variable, 2) the linear predictor (LP), which is a combination of the predictor variables, and 3) the link function that describes how the mean of the response depends on the linear predictor. Thus, the predictors are linear, but the relationship between the response and the predictors is not linear, and the link function provides a transformation of the response so that the transformed response is linearly related to the predictors.

A GLM with binomial data, such as presence/absence of a species, is commonly called “logistic regression”. In this case, the “link function” is the log of the odds ratio (probability of presence/probability of absence).

The coefficient of a predictor variable (the number that is used to multiply a variable) in a logistic regression model can be easily interpreted, as in the following hypothetical example. If a predictor, such as average annual temperature, has a positive coefficient of 0.3 in an estimated model of the occurrence of a species, this implies that a one unit increase in temperature results in an increase of exp(0.3) = 1.35 (the log-odds ratio), or 35%, in the probability of species presence.

The estimation of the values of the variable coefficients is obtained by maximum likelihood estimation (MLE), which maximizes the "agreement" of the predicted species occurrences with the observed data. In other words, MLE finds the values of the coefficients that result in a model under which you would be most likely to get the observed results. Most GLM models, including the GLM provided in BCCVL, use the iteratively reweighted least squares (IWLS) method for MLE.

Advantages

The response variable can have any form of exponential distribution type.
Able to deal with categorical predictors.
Relatively easy to interpret and allows a clear understanding of how each of the predictors are influencing the outcome.
Less susceptible to overfitting than for example CTA or MARS algorithms.

Limitations

Needs relatively large datasets. The more predictor variables, the larger the sample size required. As a rule of thumb, the number of predictor variables should be less than N/10.
Sensitive to outliers.

Assumptions

No assumptions are made about the distributions of the environmental variables. However, they should not be highly correlated with one another because this could cause problems with the estimation.

Requires absence data

Yes

Configuration options

BCCVL uses the ‘glm’ function in the ‘stats’ package, implemented in biomod2.

Configuration option	Description
Weights:	allows to give more or less weight to particular observations. If this option is kept to NULL (default), each observation (presence or absence) has the same weight (independent of the number of presences and absences). If value = 0.5 absences will be weighted equally to the presences (i.e. the weighted sum of presence equals the weighted sum of absences). If the value is set below or above 0.5 absences or presences are given more weight, respectively.
Resampling:	number of permutations to estimate the importance of each variable. If this value is >0, the algorithm will produce an object called 'variableImportance.Full.csv', in which high values mean that the predictor variable has a high importance, whereas a value close to 0 corresponds to no importance.
Type:	the type of regression model to use. This can be either a linear (option ‘simple’), a quadratic or a polynomial model. The default of this algorithm is ‘quadratic’ which creates a curved model with one “hump”, a U or inverted U shape.
Interaction level:	the number of interactions between predictor variables that need to be considered.
Test:	the biomod2 package uses an automatic stepwise selection procedure, which means that the model is built by sequentially adding or dropping predictor variables and testing whether they improve the fit of the model. Predictors that do not improve the fit of the model will be dropped. Test indicates which criteria should be used to test the fit of the model: the Akaike Information Criterion (AIC), which is the default, or the Bayesian Information Criteria (BIC). By selecting ‘none’, the stepwise procedure will be switched off, resulting in a model in which all predictor variables are contained.
Family:	the description of the error distribution of the response variable and the link function used in the model. For binary data such as presence/absence of species, the binomial family is used (default in BCCVL).
Mustart:	starting values of the vector of means
Epsilon:	positive convergence tolerance epsilon; the iterations converge when \|dev - devold\|/(\|dev\| + 0.1) < epsilon.
Maximum MLE iterations:	the maximum number of IWLS iterations to find the maximum likelihood estimates.
MLE iteration output:	whether output should be produced for each IWLS iteration (yes/no). Default is ‘no’, but if ‘yes’ is selected, the Rout log file in the BCCVL Experiment results will show the fit of the model (deviance) for each iteration.

References

Elith J, Graham CH, Anderson RP et al. (2006) Novel methods improve prediction of species’ distributions from occurrence data. Ecography, 29(2), 129-151.
Franklin J (2010) Mapping species distributions: spatial inference and prediction. Cambridge University Press.
Guisan A, Edwards TC, Hastie T (2002) Generalized linear and generalized additive models in studies of species distributions: setting the scene. Ecological modelling, 157(2), 89-100.
Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning: data mining, inference and prediction. 2nd edition, Springer.

solutions