Overview of Classification and Regression Trees

Jonathan D. Bakker

Classification

33 Overview of Classification and Regression Trees

Learning Objectives

To consider ways of classifying observations based on both response and explanatory variables.

To understand the difference between classification trees and regression trees.

Key Packages

require(tidyverse, mvpart)

Introduction

Classification and regression trees have the same objective as cluster analysis – to classify observations into groups on the basis of responses – but differ from cluster analysis in that explanatory variables are also incorporated into the classification. As a result, classification and regression trees are also known as constrained clustering or supervised clustering (Borcard et al. 2018). Borcard et al. (2018) also note that classification and regression trees have a strong focus on prediction, whereas unconstrained ordination methods focus on explanatory ability.

Classification and regression trees also differ from cluster analysis because they are divisive (vs. cluster analysis, which is agglomerative). In other words, they start with all sample units in one group and search for the best ways to split the group. This concept is why classification and regression trees are also known as recursive partitioning.

The basic idea of a hierarchical, tree-based model is familiar to most ecologists – a dichotomous taxonomic key is a simple example of one. Classification and regression trees are techniques that have entered the ecological literature relatively recently. Computationally, they can be thought of as amalgams of multiple regression, cluster analysis, discriminant analysis, and other techniques.

Classification trees refer to analyses that use categorical data for the response variable, while regression trees refer to analyses that use continuous data for the response variable. Although these types of data are distinguished in the terminology, please note that the same function can be used to analyze both, recognizing them by the class of the data. For simplicity, I will refer to them simply as regression trees in these notes. Note: These techniques are sometimes known as CART (Classification and Regression Trees), which is the proprietary name of a regression tree program (https://www.minitab.com/en-us/products/spm/).

De’ath (2002) notes that regression trees can be used to explore and describe the relationships between species and environmental data, and to classify (predict the group identity of) new observations. More generally, regression trees seek to relate response variables to explanatory variables by locating groups of sample units with similar responses in a space defined by the explanatory variables. Unique aspects of regression trees are that the ecological space can be nonlinear and that they can easily include interactions between environmental variables. The environmental variables can also be of any type (categorical, ordinal, continuous); they will be treated appropriately based on their class.

The original key citation about regression trees is Breiman et al (1984). Helpful references about regression trees are De’ath (2002), De’ath & Fabricius (2000), Vayssières et al (2000), Venables & Ripley (2002, ch. 9), and Everitt & Hothorn (2006, ch. 8).

McCune & Grace (2002, ch. 29) give a few ecological applications of regression trees. Vayssières et al (2000) used univariate regression trees to predict the distributions of three major Quercus species in California. Chase & Rothley (2007) used regression trees to predict, on the basis of 10 biogeoclimatic and positional variables, sites where grasslands and heathlands do not presently occur but could be established. Smith et al. (2019) use multivariate regression trees to model water resource systems and compare the tradeoffs associated with planning decisions.

Regression trees can be conducted with both univariate and multivariate data (De’ath 2002). We will use univariate regression trees to explore the basic concepts and then extend those concepts to multivariate regression trees.

Key Takeaways

Classification and regression trees can describe the relationship between existing variables and to predict the group identity of new observations. They do so through a divisive process, identifying the value of an explanatory variable that best separates a group of responses into two sub-groups or branches. This process is hierarchical: early splits can have cascading consequences throughout the rest of the tree.

Spider Data (and Installing `mvpart`)

For this example, we’ll begin by analyzing the relationship between the abundance of a hunting spider, Trochosa terricola, and six environmental variables. The data file is included in the mvpart package.

The mvpart package is not actively maintained and therefore is no longer available through the standard installation routines. Instead, we have to install it from a GitHub archive:

install.packages("devtools")

devtools::install_github("cran/mvpart")

Compiling this package may require that Rtools also be installed. This install should happen automatically but takes a few minutes, and afterwards you may have to re-run the command to install mvpart. If it doesn’t happen automatically, you may have to load it manually from outside of R.

Once the package has been compiled and installed, it can be loaded. We’ll load the tidyverse at the same time.

library(mvpart, tidyverse)

Now, we can load the spider data using the data() function:

data(spider)

(Note: if you are unable to load mvpart, you can still obtain the spider data from the ‘data’ folder in the GitHub repository. In that case, you can load it using spider <- read.csv("data/spider.csv", row.names = 1))

dim(spider)

[1] 28 18

The data frame contains 18 columns: 12 containing the abundances of various spider species and 6 containing environmental variables (water, sand, moss, reft [light], twigs, herbs). Each variable is quantified on an ordinal scale from 0 to 9.

We’re going to call the explanatory variables together, so let’s create a new object containing just them:

variables <- c("water", "sand", "moss", "reft", "twigs", "herbs")

env <- spider %>% select(any_of(variables))

We’ll use these data to illustrate univariate regression trees and then extend this to multivariate regression trees.

Advantages and Disadvantages of Regression Trees

Advantages of regression trees include:

Robust: minimal model assumptions:
- Do not need to assume normality
- Species-environment relationships can be nonlinear
- Don’t need to make simplifying assumptions about the data. For example, parametric models assume there is a single dominant structure in the data, whereas regression trees work with data that might have multiple structures.
Can include continuous and discrete explanatory variables in the same model
Can identify interactions among explanatory variables.
No need to preselect variables to include in model; uses automatic stepwise variable selection.
Variables can be reused in different parts of the tree. At each stage, the variable selected is the one “holding the most information for the part of the multivariate space it is currently working on” (Vayssières et al 2000, p. 683).
Relatively insensitive to outliers.
Unaffected by monotonic transformations of explanatory variables (only rank order matters for splitting)
Easily interpretable outputs.

Disadvantages of regression trees include:

Represent continuous variables by breaking them into subsets and assuming the same average value for all observations within a subset. In other words, an intercept-only model is fit to each subset. Therefore, regression trees may mask linear relationships within the data.
Later splits are based on fewer cases than the initial ones.
Splits at the top of the tree are more important than those that occur near the leaves.
Generally require large sample sizes, depending on how many observations are required in each leaf (see minsplit and minbucket arguments below).
Ability to conduct cross-validation is a function of sample size.

Boosted Trees, Forests, and Beyond

These notes provide a very simple introduction to the idea of classification and regression trees. Many other R packages are available for regression trees, including:

VSURF – to help identify which variables to focus on (Genuer et al. 2015).
treeClust – proposes a way to use classification and regression trees to calculate dissimilarities among objects that can then be used to cluster the objects using PAM (Buttrey & Whitaker 2015).
partykit – non-parametric regression trees through the function ctree() (Hothorn et al. 2006).
IntegratedMRF – univariate and multivariate random forests.

Many variations and extensions are available, including boosted regression trees (De’ath 2007, Elith et al. 2008; Parisien & Moritz 2009), random forests (Cutler et al. 2007; Van Kane will discuss these), and cascade multivariate regression trees (Ouellette et al. 2012). Essentially, these techniques involve conducting multiple regression trees and then averaging the results to yield a final ‘ensemble’ solution. These are computationally intensive procedures, but appear to produce models that have stronger predictive capabilities than single regression trees. Because they require the construction of large numbers of trees, they require large datasets. They can be conducted using packages such as randomForest.

Conclusions

Classification and regression trees are very popular in some disciplines – particularly those such as remote sensing that have access to enormous datasets. They are appealing because they are robust, can represent non-linear relationships, don’t require that you preselect the variables to include in a model, and are easily interpretable.

However, regression trees can be problematic if they are over-fit to a dataset. Boosted forests and other extensions attempt to overcome some of the issues with (mostly univariate) regression trees, though require more computational power.

References

Borcard, D., F. Gillet, and P. Legendre. 2018. Numerical ecology with R. 2nd edition. Springer, New York, NY.

Breiman, L., J.H. Friedman, R.A. Olshen, and C.J. Stone. 1984. Classification and regression trees. Chapman & Hall, New York, NY.

Buttrey, S.E., and L.R. Whitaker. 2015. treeClust: an R package for tree-based clustering dissimilarities. Journal of Statistical Software 7:227-236.

Cutler, D.R., T.C. Edwards, Jr., K.H. Beard, A. Cutler, K.T. Hess, J. Gibson, and J.J. Lawler. 2007. Random forests for classification in ecology. Ecology 88:2783-2792.

De’ath, G. 2002. Multivariate regression trees: a new technique for modeling species-environment relationships. Ecology 83:1105-1117.

De’ath, G., and K.E. Fabricius. 2000. Classification and regression trees: a powerful yet simple technique for ecological data analysis. Ecology 81:3178-3192.

Elith, J., J.R. Leathwick, and T. Hastie. 2008. A working guide to boosted regression trees. Journal of Animal Ecology 77:802-813.

Genuer, R., J-M. Poggi, and C. Tuleau-Malot. 2015. VSURF: an R package for variable selection using random forests. Journal of Statistical Software 7:19-33.

Hothorn, T., K. Hornik, and A. Zeileis. 2006. Unbiased recursive partitioning: a conditional inference framework. Journal of Computational and Graphical Statistics 15:651-674.

Ouellette, M-H., P. Legendre, and D. Borcard. 2012. Cascade multivariate regression tree: a novel approach for modelling nested explanatory sets. Methods in Ecology and Evolution 3:234-244.

Parisien, M.-A., and M.A. Moritz. 2009. Environmental controls on the distribution of wildfire at multiple spatial scales. Ecological Monographs 79:127-154.

Media Attributions

trochosa_terricola_8313

License

Icon for the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

Applied Multivariate Statistics in R Copyright © 2024 by Jonathan D. Bakker is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, except where otherwise noted.