Types of Cluster Analyses

Jonathan D. Bakker

Classification

28 Types of Cluster Analyses

Learning Objectives

To appreciate the range of types of cluster analyses available, based on whether they are hierarchical or not, agglomerative or divisive, polythetic or monothetic, and sequential or simultaneous.

Introduction

Cluster analysis seeks to classify observations into groups such that each observation is more similar to the other observations in its group than to observations in other groups. This is a descriptive process, not one in which hypotheses are tested. A cluster analysis can be applied to any dataset, whether or not there is an underlying gradient(s) or structure to the data.

Cluster analysis is based entirely on the response variables; explanatory variables are not required. This is in contrast to classification and regression trees, which we will discuss separately.

In the types of cluster analysis considered here, the resulting clusters are assumed to be discrete – each sample unit is assigned to one (and only one) group. However, it’s worth remembering that data and communities often vary continuously.

On a related point, “not all problems are clustering problems. Before engaging in clustering, one should be able to justify why one believes that discontinuities exist in the data; or else, explain that one has a practical need to divide a continuous swarm of objects into groups” (Legendre & Legendre 2012, page 338).

Cluster analyses are subject to the same considerations that affect all other analyses, including:

What sampling strategy will you use (or was used to obtain your data)? Note that representative samples can be acceptable when conducting a cluster analysis; you do not need random samples.
What measurements will be made? Will you sample all species or a subset of them?
Will you standardize / transform data? If so, how?
Which distance measure to use? Choose one that best preserves information from the system being studied.

There are many types of cluster analyses and no single decision tree to easily distinguish them all. Instead, it’s more helpful to answer several questions.

Hierarchical or Non-hierarchical?

In a hierarchical analysis, groups are composed of subgroups. Thus, the structure at a given level is constrained by the structure at other levels in the dendrogram. A phylogenetic tree is a classic example of this type of structure: Pseudotsuga menziesii is within the genus Pseudotsuga, which is within the family Pinaceae, etc. While community-level data may not be hierarchical to this degree, this type of approach is quite common. When folk refer to a ‘cluster analysis’, they often mean a hierarchical approach, as illustrated by a dendrogram.

In a non-hierarchical analysis, observations in the same group at one ‘level’ of the analysis may or may not occur in the same group at another level. The goal is to optimize the structure at a given level irrespective of what the structure might be like at another level. A common example is k-means clustering. Since this is non-hierarchical, a dendrogram is not produced.

Does It Build Groups Up or Tear Them Apart?

Hierarchical cluster analyses can move from simplicity to complexity or from complexity to simplicity. This is analogous to how a stepwise regression can be done forward or backward, though each clustering approach is a different type of technique. Like a stepwise regression, these approaches can yield different conclusions.

Agglomerative methods begin with each sample unit assigned to its own cluster and then iteratively fuse (combine) the two most similar clusters, continuing until there is just a single cluster. Distances between clusters are recalculated at each stage. In R, agglomerative clustering can be performed using the stats:hclust(), cluster::agnes(), and mclust::mclust() functions, among others. When folk refer to a ‘cluster analysis’, this is often what they mean.

Divisive methods begin with a single cluster and then systematically divide the cluster into subgroups. TWINSPAN is a famous example of a hierarchical divisive method (Hill et al 1975; McCune & Grace 2002, ch. 12), though it’s rarely used at present (Zeleny 2015). In R, divisive clustering can be performed using the cluster::diana() and cluster::mona() functions. Classification and regression trees are also a divisive technique.

Based On All Available Data, Or On One Variable At A Time?

Polythetic methods are multivariate, and are often based on a similarity or dissimilarity matrix.

Monothetic methods examine all variables to identify the single variable that will best separate groups. These techniques are often divisive rather than agglomerative.

What Is The Order Of Operations?

Sequential – solution is reached through a series of steps. For example, a hierarchical technique often will:

Identify and group the nearest two sample units
Compare this group with the other sample units to identify the next most similar pair of objects
Repeat this process until all sample units were included in a single group.

Similarly, a non-hierachical technique can start with random centroids for groups and then iteratively adjust them until some stopping criterion is met.

Simultaneous – solution is reached in a single step. Uncommon.

Conclusions

The broad category of ‘cluster analysis’ encompasses a large range of techniques. The most influential differences are between hierarchical and non-hierarchical techniques and between agglomerative and divisive techniques.

References

Hill, M.O., R.G.H. Bunce, and M.W. Shaw. 1975. Indicator species analysis, a divisive polythetic method of classification, and its application to a survey of native pinewoods in Scotland. Journal of Ecology 63:597-613.

Legendre, P., and L. Legendre. 2012. Numerical ecology. Third English edition. Elsevier, Amsterdam, The Netherlands.

McCune, B., and J.B. Grace. 2002. Analysis of ecological communities. MjM Software Design, Gleneden Beach, OR.

Zelený, D. 2015. TWINSPAN in R. https://davidzeleny.net/blog/2015/05/10/twinspan-in-r/

License

Icon for the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

Applied Multivariate Statistics in R Copyright © 2024 by Jonathan D. Bakker is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, except where otherwise noted.