Random Forest : BCCVL (Sandpit)

Random Forests grow many decision trees and average the predictions of these trees to estimate the importance of each predictor variable.

Introduction

The Random Forest algorithm produces a large number of decision trees (classification trees for categorical data or regression trees for continuous data) with random subsets of the data. About ⅓ of the data is not used that is not in the sample (‘out-of-the-bag’) is used to evaluate the model. This procedure is called ‘bagging’. In a Random Forest analysis, each split within a decision tree is developed with a random subset of predictor variables. The trees are grown to their maximum size without pruning, and then the predictions of all trees are averaged to find the set of predictor variables that produce the strongest classification model. The advantage of this is that with a greater number of trees the error rate is much lower than with a single tree analysis.

Advantages

One of the most accurate learning algorithms available.
It can handle many predictor variables.
Provides estimates of the importance of different predictor variables.
Maintains accuracy even when a large proportion of the data is missing.

Limitations

Can overfit datasets that are particularly noisy.
For data including categorical predictor variables with different number of levels, random forests are biased in favor of those predictors with more levels. Therefore, the variable importance scores from random forest are not always reliable for this type of data.

Assumptions

No formal distributional assumptions (non-parametric).

Requires absence data?

Yes

Configuration options

BCCVL uses the ‘randomForest’ package, implemented in biomod2.

Configuration option	Description
Weights:	allows to give more or less weight to particular observations. If this option is kept to NULL (default), each observation (presence or absence) has the same weight (independent of the number of presences and absences). If value = 0.5 absences will be weighted equally to the presences (i.e. the weighted sum of presence equals the weighted sum of absences). If the value is set below or above 0.5 absences or presences are given more weight, respectively.
Resampling:	number of permutations to estimate the importance of each variable. If this value is >0, the algorithm will produce an object called 'variableImportance.Full.csv', in which high values mean that the predictor variable has a high importance, whereas a value close to 0 corresponds to no importance.
do.classif:	select this option to model classification trees (with binary data), deselect to model regression trees (with continuous data).
ntree:	the number of trees to grow. Random forests stabilize at about 200 trees, and therefore this number should not be too small (default = 500). A larger number of trees reduces the error rate and ensures that every input row gets sampled at least a few times, but will take longer computation time.
mtry:	the number of predictor variables that is randomly sampled as candidates at each split.
nodesize:	the minimum size (number of observations) of terminal nodes. The default for classification trees is 1, for regression trees is 5. Setting this number to a larger value causes smaller trees to be grown and thus computation take less time. However, for best accuracy use nodesize = 1.
maxnodes:	the maximum number of terminal nodes that trees in the forest can have. If no value is specified, trees are grown to the maximum possible or until the minimum size of terminal nodes (determined in nodesize) is reached.

References

Breiman L (2001) Random forests. Machine learning, 45(1), 5-32.

Franklin J (2010) Mapping species distributions: spatial inference and prediction. Cambridge University Press.

Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning: data mining, inference and prediction. Springer.

solutions

Random Forest