Classification Tree Analysis

Classification Trees analyse species occurrence by repeatedly splitting the dataset into mutually exclusive groups based on a threshold value of one of the explanatory variables.


Summary

Classification tree analysis repeatedly splits a dataset into subgroups, based on a threshold value of one of the predictor variables. The tree is constructed by searching through the predictor variables to find the value of one variable that best splits the dataset into two groups. At each split the dataset is partitioned into two mutually exclusive groups, each of which is as homogeneous as possible. This splitting process is repeated with each separate group until a stopping criteria is evoked, such as the subgroups have reached a minimum size, or no further improvement can be made.


Classification trees consist of three different types of nodes, connected by directed edges (branches):

  • Root node: no incoming edges - this represents the undivided data at the top
  • Internal nodes: have exactly 1 incoming edge, and 2 or more outgoing edges
  • Leaf nodes: have exactly 1 incoming edge, and no outgoing edges (also called: terminal nodes)

Classification Tree Analysis consists of three steps:

  1. tree building or growing, by repeatedly splitting the data into subsets; 
  2. tree stopping, when predefined criteria are met. This can either be when splitting is impossible because all remaining observations have the same value of predictor variables, when the minimum number of observations that need to be present in a leaf node is left, or when the maximum number of splits in the tree is achieved. 
  3. tree pruning or optimal tree selection, to avoid overfitting of the data and reduce the tree complexity by keeping only the most important splits.


Although classification trees provide a very useful tool to visualize the hierarchical effects of multiple environmental variables on species occurrence, they are often criticized for being unstable and having low prediction accuracy. This has led to the development of other methods that build upon classification trees, such as random forests and boosted regression trees. 


Advantages

  • Simple to understand and interpret.
  • Can handle both numerical and categorical data.
  • Identify hierarchical interactions between predictors.
  • Characterize threshold effects of predictors on species occurrence.
  • Robust to missing values and outliers.

Limitations

  • Less effective for linear or smooth species responses due to the stepwise approach.
  • Requires large datasets to detect patterns, especially with many predictors.
  • Very unstable: small changes in the data can change the tree considerably.

Assumptions

No formal distributional assumptions, classification trees are non-parametric and can thus handle skewed and multi-modal data as well as categorical data that are ordinal or non-ordinal.


Requires absence data?

Yes.


Configuration options

BCCVL uses the 'part' package, implemented in biomod2.


Configuration option Description
Weights:

allows to give more or less weight to particular observations. If this option is kept to NULL (default), each observation (presence or absence) has the same weight (independent of the number of presences and absences). If value = 0.5 absences will be weighted equally to the presences (i.e. the weighted sum of presence equals the weighted sum of absences). If the value is set below or above 0.5 absences or presences are given more weight, respectively.

Resampling:

number of permutations to estimate the importance of each variable. If this value is >0, the algorithm will produce an object called 'variableImportance.Full.csv', in which high values mean that the predictor variable has a high importance, whereas a value close to 0 corresponds to no importance.

Method:

select "class" for a classification tree with categorical response data, or "anova" for a regression tree with continuous response data.

Cross-validations:

the default number of cross-validations is 10, which means that the dataset is divided into 10 subsets, using 9 sets as ‘learning samples’ to create trees, and 1 set as ‘test samples’ to calculate error rates. This process is repeated for all possible combinations of learning and test samples (a total of 10 times), and error rates are averaged to estimate the error rate for the full data set. These error rates can then be used to prune the tree using the standard ‘1-SE’ rule, which states that the tree with the most nodes should be selected as long as the corresponding cross-validation error is within one standard deviation of the minimum cross validation error. The corresponding cp value (see below) can then be specified to generate an optimally pruned tree. 

Minimum bucket:

the minimum number of observations that needs to be present in any terminal node.

Minimum split:

the minimum number of observations that must exist in a node for a split to be attempted. This is a good way to limit the growing of the tree. When a node contains too few observations, further splitting will result in overfitting.

The optimum minimum split depends on the number of observations and predictor variables in your dataset. With a limited number of observations you do not have the luxury to stop early or you will end up with no tree at all. With a lot of observations you can stop early and still obtain a large enough decision tree. The more predictor variables in your dataset, the bigger the possibility of having some accidental relationship between one of the variables and the response. So with a lot of variables you should stop earlier.

NB. If only one of "minimum bucket" or "minimum split" is specified, the code will set "minsplit" to "minbucket * 3", or "minbucket" to "minsplit / 3".

Complexity parameter:

the complexity measure is a combination of the size of a tree and the ability of the tree to separate the classes of the response variable (e.g. presence/absence of a species). If the next best split in growing a tree does not reduce the tree’s overall complexity by a certain amount, specified as the ‘complexity parameter’, the algorithm will terminate the growing process.

A value of ‘cp’ = 1 will always result in a tree with no splits, while a value of ‘cp’ = 0 will build the maximum possible tree, which can potentially be very complex. You can use the information of the full tree to select the best ‘cp’ value for your final model. As a rule of thumb, it is best to prune a classification tree using the ‘cp’ of the smallest tree that is within one standard deviation of the tree with the smallest xerror.

Maximum depth:

the maximum number of splits of the tree. This is a limit to stop further splitting of nodes when the specified tree depth has been reached during the building of the initial decision tree. 


References

Breiman L, Friedman JH, Olshen RH, Stone CJ (1984) Classification and regression trees. Chapman and Hall, New York, USA.

 

De'ath G & Fabricius KE (2000) Classification and regression trees: a powerful yet simple technique for ecological data analysis. Ecology, 81(11), 3178-3192.

 

Franklin J (2010) Mapping species distributions: spatial inference and prediction. Cambridge University Press.