"

Foundational Concepts

12 Common Distance Measures

Learning Objectives

To consider various distance measures used with ecological data, and the types of data for which they are appropriate.

To consider whether shared absences matter.

To demonstrate how to deal with empty sample units.

To explore ways to identify patterns within compositional variation.

Resources

Legendre & De Cáceres (2013)

Key Packages

require(tidyverse, vegan, labdsv, ecodist, betapart)

Introduction

Distance measures are an essential component of many ecological analyses.  There are many to choose from; Legendre & De Cáceres (2013) compared 16 of them, and Legendre & Legendre (2012, Table 7.2) list 26 of them!  Borcard et al. (2018) devote an entire chapter to distance measures, using two primary criteria to organize their discussion:

  • What are the distributional characteristics of the data? For example, are the data binary (e.g., presence/absence) or continuously distributed (e.g., abundance)?
  • Are shared absences meaningful? A shared absence is a species (or other variable) that is absent from both sample units under consideration.  Data are 'symmetric' if a shared absence is meaningful, and 'asymmetric' if it is not.  Symmetry in this case means that two samples having the same zero value is as meaningful as those samples having another value the same (e.g., if the value of a variable was 1.5 for both samples).  Species composition data are a prime example of a situation where shared absences are not meaningful, as discussed in the 'Two Issues to Consider' section below.  Note that symmetry here is not the same thing as was discussed earlier for matrix algebra.

Combinations of these criteria are most appropriately handled using different types of distance measures.  A few distance measures are identified here; those discussed below are in bold.

Distribution Species data (asymmetric) Non-species data (symmetric)
Quantitative (continuous) Bray-Curtis

Chi-Square

UniFrac

Euclidean

Manhattan

Mixed, including categorical Gower
Binary Jaccard

Sorenson

UniFrac

Simple matching

Note: the simple matching distance measure is not commonly used in ecology, and is not discussed here.

Euclidean Distance

The Pythagorean theorem is easily visualized in two dimensions, as we did in the last chapter.  It can also be applied to more dimensions, though it rapidly becomes difficult to visualize this.

The formula for the Euclidean Distance (ED) between samples i and h across p dimensions is:

[latex]ED = \sqrt{\sum_{j=1}^p(a_{hj} - a_{ij})^2}[/latex]

 

Here is a dataset reporting the presence or absence of each of five species (variables) on three plots:

Plot SppA SppB SppC SppD SppE
1 1 1 1 0 0
2 0 0 0 1 1
3 1 1 1 1 1

(Source: Legendre & Legendre 2012, p. 311)

 

What is the Euclidean distance (ED) between each pair of plots?

ED(1,2) = ________

ED(1,3) = ________

ED(2,3) = ________

Verify that the distance from a plot to itself is zero (property 1), and that the distance from plot 1 to plot 2 is the same as the distance from plot 2 to plot 1 (property 3).  Euclidean distances are positive (property 2), metric (property 4), and do not have an upper limit (property 5).

 

Euclidean distances can be calculated using positive and negative values. One potential limitation of this distance measure is that the calculated distance depends on the scale of the variables.  For example, if variables are measured using very different scales (e.g., biomass in g for forbs, Mg for trees), the distances will be disproportionately affected by the variables measured using the larger scales (Legendre & Legendre 2012).  However, this can be addressed by relativizing the variables appropriately before calculating distances.

 

Euclidean distances are appropriate for many types of data, including geographic distances.

However, Euclidean distances are generally inappropriate for community data (e.g., a plot x species matrix containing the cover or presence/absence of multiple species).  Why?  One reason is that it’s possible for two samples with no species in common to have a smaller Euclidean distance than two samples that share species.  For example, compare the Euclidean distances among the following plots:

Plot SppA SppB SppC
4 0 4 8
5 0 1 1
6 1 0 0

(Source: modified from Legendre & Legendre 2012, Figure 7.8)

ED(4,5) = ________

ED(4,6) = ________

ED(5,6) = ________

Verify that plots 5 and 6 are more similar than plots 4 and 5.  Many ecologists find this unappealing because plots 5 and 6 have no species in common whereas plots 4 and 5 share the same species and differ only in abundance.  Instead, they would argue that the presence of the same species is more important than a difference in abundance of that species.

Manhattan Distance

Euclidean distances are calculated by squaring the difference associated with each variable, but an alternative is to simply add the (absolute) differences.  This is analogous to summing the two perpendicular sides of a triangle rather than using the Pythagorean theorem to calculate the hypotenuse.

The formula for the Manhattan distance between samples i and h across p dimensions is:

[latex]MD = \sum_{j=1}^p \mid ( a_{hj} - a_{ij} ) \mid[/latex]

 

In the last chapter, we used the Euclidean distance to calculate the hypotenuse between two sample units based on the numbers of annual and perennial species.  The Manhattan distance between these sample units is the sum of the two perpendicular sides of the triangle.

This is also called the city-block distance ... can you see why?

Jaccard Similarity and Dissimilarity

Let’s consider presence/absence data some more.  For any two plots, species occurrences can be summarized in a contingency table:

Contingency table showing presence and absence in each of two sample units
When comparing two sample units, species can be present in both (a), absent in one but present in the other (b, c), or absent from both (d).

 

Note that this is not a data matrix.  Rather:

  • a is the number of species that are present in both plots
  • b is the number of species that are present in plot A but missing from plot B
  • c is the number of species that are missing from plot A but present in plot B
  • d is the number of species that are missing from both plots.

Jaccard (1912) proposed that we quantify the proportion of species that are present in both samples.  This is known as Jaccard similarity ([latex]S_{J}[/latex]):

[latex]S_{J} = \frac{a}{a + b + c}[/latex]

As a proportion, this value is bounded between 0 (no shared species) and 1 (all shared species).  This is a metric measure.

Note that species that are missing from both plots (d) are not included in this calculation; see the ‘Two Issues to Consider’ section below for more information on this.

 

Since Jaccard similarity has an upper bound of 1, it is easily converted to Jaccard dissimilarity ([latex]D_{J}[/latex]) by subtraction.  [latex]D_{J}[/latex] can also be calculated directly from the contingency table of species occurrences:

[latex]D_{J} = 1 - S_{J} = \frac{b + c}{a + b + c}[/latex]

Jaccard dissimilarity is the proportion of species that are absent from one of the samples.

 

Refer back to plots 1-3 for which we calculated Euclidean distances.  What is the Jaccard dissimilarity between each pair of plots?

Plot1 Plot2
Plot2
Plot3

 

Note: Recent work has decomposed or partitioned Jaccard dissimilarities into two components, turnover and nestedness (Baselga 2010, 2012).  Turnover is species replacement (one species replaced with another), while nestedness is the extent to which the composition of one sample unit is a subset of the composition of another sample unit.  See the description of the betapart package below for more information.

Sorensen Similarity and Dissimilarity

Sorensen similarity ([latex]S_{S}[/latex]) is the proportion of species that are present in both samples, while accounting for differences in species richness between samples.  Using the same terminology as for Jaccard similarity, the formula is:

[latex]S_{S} = \frac{a}{\frac{(a + b) + (a + c)}{2}} = \frac{2a}{2a + b + c}[/latex]

Like Jaccard similarity, this value is bounded between 0 (no shared species) and 1 (all shared species).  Unlike Jaccard, however, it is semimetric.

Several people proposed this distance measure independently; the original publication by Sørensen is from 1948.

 

Since [latex]S_{S}[/latex] is a proportion, it can be converted to Sorensen dissimilarity ([latex]D_{S}[/latex]) by subtraction.  [latex]D_{S}[/latex] can also be calculated directly from the contingency table:

[latex]D_{S} = 1 - S_{S} = 1 - \frac{2a}{2a + b + c} = \frac{b + c}{2a + b + c}[/latex]

Sorensen dissimilarity is the proportion of species that are absent from one of the samples.

 

Refer back to plots 1-3.  What is the Sorensen dissimilarity between each pair of plots?

Plot1 Plot2
Plot2
Plot3

Notice that these data do not satisfy the triangle inequality: the dissimilarity from plot 1 to 3 plus the dissimilarity from plot 3 to plot 2 is less than the dissimilarity from plot 1 to 2.  This demonstrates that the Sorensen dissimilarity is a semimetric measure.

 

Note: Recent work has decomposed or partitioned Sorensen dissimilarities into two components, turnover and nestedness (Baselga 2010, 2012).  Turnover is species replacement (one species replaced with another), while nestedness is the extent to which the composition of one sample unit is a subset of the composition of another sample unit.  See the description of the betapart package below for more information.

Bray-Curtis Distance

When the formula for Sorensen dissimilarity is extended from presence/absence data to species abundance data, it results in the Bray-Curtis distance measure:

[latex]D_{i,h} = \frac{\sum_{j=1}^p \mid a_{ij} - a_{hj} \mid} { \sum_{j=1}^p a_{ij} + \sum_{j=1}^p a_{hj} } = 1 - \frac{ 2 \sum_{j=1}^p MIN(a_{ij}, a_{hj}) }{ \sum_{j=1}^p a_{ij} + \sum_{j=1}^p a_{hj} } = 1 - \frac{ 2 \sum_{j=1}^p MIN(a_{ij}, a_{hj}) }{ a_{i\cdot} + a_{h\cdot} }[/latex]

where

  • [latex]p[/latex] is the total number of species
  • [latex]a_{ij}[/latex] is the abundance of species j in sample unit i
  • [latex]a_{hj}[/latex] is the abundance of species j in sample unit h
  • [latex]a_{i\cdot}[/latex] is the total abundance of all species in sample unit i
  • [latex]a_{h\cdot}[/latex] is the total abundance of all species in sample unit h

These formulae are from chapter 6 of McCune & Grace (2002).  The middle and right-hand versions are the same except that in the right-hand one I used the same terminology in the denominator as in the formula for the chi-square distance below.  This is to permit easier comparisons between the two measures.  Note that these formulae are based on the data matrix, not on the contingency table that was the basis of the Sorensen dissimilarity.

 

The Bray-Curtis distance measure is bounded between 0 (the sample units are identical) and 1 (the sample units are completely different), and is semimetric.

 

The Bray-Curtis distance measure is named after the co-authors of the paper in which it was used (Bray & Curtis 1957).  However, and confusingly, it is also known by many other names: Steinhaus, Czekanowski, Sorensen, and percentage difference.  Often this is because the same measure was proposed independently or because two measures were proposed that were later shown to be mathematically equivalent.

 

Several studies (notably, Faith et al. 1987) have concluded that the Bray-Curtis distance measure functions best for community data (e.g., a plot x species matrix).  We will see it throughout this course.

Borcard et al. (2018) note that the Bray-Curtis distance “gives the same importance to absolute differences in abundance irrespective of the order of magnitude of the abundances … a difference of 5 individuals has the same weight when the abundances are 3 and 8 as when the abundances are 6203 and 6208” (p.39).  If this is problematic, the data can be log-transformed before computing distances.

Note: Recent work has decomposed or partitioned Bray-Curtis distances into two components, one related to ‘balanced variation in abundance’ and the other to ‘abundance gradients’ (Baselga 2013).  These components are analogous to the turnover and nestedness components of Jaccard dissimilarities.  See the description of the betapart package below for more information.

Chi-Square Distance

The Chi-square distance measure is the basis of an ordination technique known as correspondence analysis, which, with its variants, is popular in some quarters.  Simulation tests have found that chi-square distances do not perform well with community data (Faith et al. 1987), but the popularity of correspondence analysis means that it is helpful to be familiar with this measure.

The formula for chi-square distance is:

[latex]D_{i,h} = \sqrt {\sum_{j=1}^p \frac{1}{a_{\cdot j}} [ \frac{a_{hj}}{a_{h\cdot}} - \frac{a_{ij}}{a_{i\cdot}} ]^2}[/latex]

where

  • [latex]p[/latex] is the total number of species
  • [latex]a_{ij}[/latex] is the abundance of species j in sample unit i
  • [latex]a_{hj}[/latex] is the abundance of species j in sample unit h
  • [latex]a_{i\cdot}[/latex] is the total abundance of all species in sample unit i
  • [latex]a_{h\cdot}[/latex] is the total abundance of all species in sample unit h
  • [latex]a_{\cdot j}[/latex] is the total abundance of species j across all sample units

This formula is from chapter 6 of McCune & Grace (2002).

 

Like Euclidean distances, this measure involves summing squared differences.  However, the chi-square distance measure also:

  • Expresses the abundance of each species ([latex]a_{hj}[/latex] and [latex]a_{ij}[/latex]) as a proportion of the total abundance on the sample unit ([latex]a_{h\cdot}[/latex] and [latex]a_{i\cdot}[/latex]). In other words, relativization by row total is built into this distance measure.
  • Weights the squared difference between the sample units by the inverse of the total abundance of the species ([latex]a_{\cdot j}[/latex]).  This is also a built-in relativization, but it's not one of the common ones that we've discussed previously (e.g., relativization by column maximum).  Instead, this is relativization by column total.

This last aspect – weighting by the total abundance of the species – is where some of the problems arise.  It means, for example, that:

  • The distance between two sample units depends on which other sample units are included in the data matrix.
  • Since this is an inverse weighting, common species are downplayed and rare species are weighted more strongly.

UniFrac Distance

In the above distances, each species is considered individually and the distance between them are summed to determine the total distance between sample units.  Another approach is to account for the degree of relatedness among species.  For example, consider these three plots:

Plot Poa pratensis Poa compressa Hypochaeris radicata
7 5 0 0
8 0 5 0
9 0 0 5

By any of the above distance measures, the magnitude of the difference is the same between any two plots.  However, the species in plots 7 and 8 are from the same genus and thus arguably these plots are more similar to one another than to plot 9.

UniFrac incorporates phylogenetic information about the taxa present in the sample units.  A phylogenetic tree is constructed that connects these taxa.  The lengths of these branches are a measure of how similar one taxon is to another.  Then, for each pair of sample units:

  • The taxa present in one or both sample units are identified.
  • Each branch is coded as 'shared' if the taxon is present in both sample units and as 'unshared' if the taxon is only present in one sample unit.
  • The distance between two sample units is the sum of unshared branch lengths as a proportion of the total of all tree lengths.

UniFrac is a metric measure with values ranging from 0 to 1.

UniFrac distances can be equal to or smaller than (but not larger than) distances that do not incorporated phylogeny.  Ertsgaard et al. (2025) showed a strong positive correlation between UniFrac and Jaccard distances when describing the plant composition of alpine peaks in Washington state.

 

The original UniFrac measure (Lozupone & Knight 2005) is unweighted, meaning that it gives equal weight to each taxa (analogous to Jaccard and Sorenson dissimilarities).  There is also a weighted UniFrac which accounts for the relative abundances of taxa (analogous to Bray-Curtis distance) as well as a generalized UniFrac that combines the weighted and unweighted approaches in a single framework (Chen et al. 2012).  More recently, this approach has been adapted for use in paired and longitudinal designs as are commonly used in microbiome studies (Plantinga et al. 2019).

This distance measure is not as easy to apply as the others as it requires phylogenetic information in addition to the standard composition matrix (sample units x species).  It obviously is only relevant for compositional data.

There are a large number of other phylogenetic distances available.  Cadotte & Davies (2016) provide examples of how to build phylogenetic trees and use them to calculate a wide range of metrics (see their Table 3.3).

Gower’s Distance

Most distance measures assume that the underlying data are continuously distributed, but Gower (1971) proposed a generalized coefficient of dissimilarity that can be applied to a dataset consisting of a variety of data types: continuously distributed, nominal, and/or ordinal variables.  Greenacre & Primicerio (2013) provide a nice description of this approach.

Gower’s distance is available in several of the functions summarized below.

Mahalanobis Distance

Mahalanobis (1936) proposed a way to compare a sample to sets of other samples and determine how likely the sample belongs to each set.  The formula includes the covariance matrix to account for differences in variability among variables.  It is intended for multivariate normal data.

One way the Mahalanobis distance can be used is to evaluate whether individual sample units are outliers relative to the rest of the data - see the ‘Multivariate Outlier Analysis’ chapter for more information.

Should Shared Absences Matter?

When summarizing species data, ecologists generally prefer distance measures that are not affected by species that are absent from the two experimental units being compared (cell d in the contingency table shown in the description of Jaccard similarity above).  Species could be absent for any number of reasons, and it generally doesn’t make sense to determine the similarity between two samples based on species that are present in neither.

However, some research has shown that species absences can be informative when dealing with datasets that contain high beta diversity (species turnover among plots).  For more information about ‘extended dissimilarities’, see De’ath (1999) and Boyce & Ellison (2001), the discussion of the dsvdis() function below, and the help files for the stepacross() function in vegan.  Note that using a function such as stepacross() will change how distances are calculated such that they no longer have a fixed upper bound.  Anderson et al. (2011) provide an extended discussion about beta diversity.

Dealing With Empty Sample Units

We’ve mentioned before that most distance measures assume that at least one species is present in all plots.  This can be problematic if you are dealing with a community in which organisms are heterogeneously distributed – their absence from a given area may be very important information!  For example, consider these data:

Plot SppA SppB SppC
4 0 4 8
5 0 1 1
6 1 0 0
10 0 0 0

These are plots 4-6 from above, together with plot 10 which is empty – there were no species in this sample unit.  There are two common ways to deal with this.

First, you could delete the empty sample units, as we discussed earlier (see ‘Data Adjustment’ chapter).  However, this changes the questions being asked - for example, from "how does composition differ among all plots?" to "how does composition differ among vegetated plots?".

Second, you could use a ‘zero-adjusted’ distance measure (Clarke et al. 2006).  This simply involves adding a ‘pseudo-species’ with the same small cover value to all plots.  Often, the cover value used is the minimum value that a species would be assigned if present on a plot.  This pseudo-species is added to all plots - not just those that are empty - so that it alters all distances in the same manner.  This approach can be applied to data subject to any distance measure.  You would want to exclude pseudo-species when calculating metrics such as species richness.

 

Using one of the functions described below, verify that you cannot calculate the Bray-Curtis distance between plot 10 and any of the other plots.  Then, add a ‘dummy’ species with an abundance of 1 in all plots and re-calculate the distance matrix.

Some code that may be of interest:
eg <- data.frame(
Plot = c("p4", "p5", "p6", "p10"),
SppA = c(0, 0, 1, 0),
SppB = c(4, 1, 0, 0),
SppC = c(8, 1, 0, 0) )
Spp <- c("SppA", "SppB", "SppC")
vegan::vegdist(eg[ 1:3 , colnames(eg) %in% c(Spp)])
eg$SppDummy <- 1
vegan::vegdist(eg[ , colnames(eg) %in% c(Spp, "SppDummy")])

Partitioning Compositional Variation

Compositional variation between sample units can reflect two primary patterns, substitution and loss or subsetting.  We can consider this both for incidence-based metrics (Jaccard, Sorensen) and for abundance-based metrics (Bray-Curtis).

The total dissimilarity between sample units and and its components are closely related mathematically: the total is the sum of the two components.  Thus, although it is possible to analyze each component, I often focus on just the total and the proportion of the total dissimilarity attributable to one of the two components (e.g., Bakker et al. 2023).

Incidence-based Metrics (Jaccard, Sorensen)

It is easiest to think about the incidence-based metrics - Jaccard and Sorensen.  For any pair of sites, the differences in presence or absence of species might reflect:

  • Nestedness (losses: one site contains a subset of the species present at the other site)
  • Spatial turnover (substitution: different species present at each site)

Either or both of these components might be evident.  These components are additive: the two components sum to the total dissimilarity between the sites as expressed by one of these metrics (Baselga 2010, 2012).

 

The following image shows four scenarios in which 12 species are present or absent in sites:

Examples of how compositional variation among sites might reflect differences in nestedness, spatial turnover, and richness
Examples of how variation in incidence-based composition among sites might reflect differences in (A) nestedness, (B) spatial turnover, (C) both nestedness and spatial turnover, and (D) spatial turnover and differences in richness.  Figure 1 from Baselga (2010).

In scenario (A), site A1 contains 12 species and sites A2 and A3 contain subsets of these species.  Thus, the compositional difference between each site reflects patterns of nestedness.

In scenario (B), total species richness is the same at all sites.  Some species occur at all sites but other species only occur at individual sites.  Thus, the compositional differences between these sites are due to spatial turnover.

In scenario (C), some of the compositional differences are due to nestedness (sites C2 and C3 contain subsets of the species from site C1) but there is also spatial turnover between sites C2 and C3.

In scenario (D), total species richness differs between sites and there is also spatial turnover among sites.

Abundance-based Metrics (Bray-Curtis)

In a 2013 paper, Baselga extended his work to an abundance-based metrics, the Bray-Curtis dissimilarity.  If the same species are present at both sites, differences in their abundance might reflect:

  • Abundance gradients (losses: all species are less abundant at one site than the other)
  • Balanced variation (substitution: increases in some species are offset by declines in other species)

Either or both of these components might be evident.  These components are additive: the two components sum to the total Bray-Curtis dissimilarity between sites (Baselga 2013).

These abundance-based components are directly analogous to the components of incidence-based metrics (abundance gradients ~ nestedness; balanced variation ~ spatial turnover) but have been given different names to distinguish them.  However, it should also be noted that if sites differ in both abundance and in species presence/absence, then the value for each of these components will be a combination of the abundance- and incidence-based metrics.  In other words, the value calculated for abundance gradients will reflect abundance gradients and nestedness, while the value calculated for balanced variation will reflect balanced variation and spatial turnover.

 

The following image shows six scenarios in which Bray-Curtis dissimilarity between the two sites is identical but the underlying processes differ:

Examples of how variation in abundance-based composition among sites might reflect different processes
Examples of how variation in abundance-based composition among sites might reflect different processes. In all six scenarios, the Bray-Curtis dissimilarity between the two sites is 0.33.  In (a-c), the same species are present at all sites so there are no incidence-based differences; the compositional variation among sites can be driven by (a) balanced variation, (b) abundance gradients, or (c) both balanced variation and abundance gradients. In (d-f), the compositional variation reflects a combination of incidence-based processes - nestedness and spatial turnover - and abundance-based processes.  Figure 1 from Baselga (2013).

In scenario (a), the reduced abundance of some species is offset by increased abundance of other species.  Note that the total abundance for each site remains at 60; all differences between the sites are due to balanced variation.

In scenario (b), every species is more abundant in site B1 than site B2, and total abundance is 80 for site B1 but 40 for site B2.  The differences between these sites are due to the abundance gradient.

In scenario (c), some of the compositional variation is due to balanced variation (species 4 increases from site C1 to site C2 while the other species show the opposite pattern) and some of the variation is due to the abundance gradient.

In scenario (d), there is no abundance gradient so the difference is due to balanced variation and, since species 1 and 4 are not just reduced by are missing from one site, specifically with spatial turnover.

In scenario (e), there is an abundance gradient but no balanced variation.  Inspection of the patterns demonstrates that this abundance gradient reflects species nestedness.

In scenario (f), there is both balanced variation and an abundance gradient, and the abundance gradient also includes species nestedness.

R Functions to Calculate Distance Measures

The above distance measures, and others, can be calculated using many R functions, including:

  • dist() in stats
  • vegdist() and betadiver() in vegan (this package also includes the designdist() function for constructing your own distance measure)
  • dsvdis() in  labdsv
  • distance() and bcdist() in ecodist
  • gdist() and xdiss() in mvpart
  • daisy() in cluster
  • gowdis() in FD
  • beta.pair() and bray.part() in betapart

We’ll survey a few of these functions here.  Don’t forget to load packages before calling their functions!  Details about the measures calculated by each function can be obtained through the R help files.

Keep in mind that many distance measures are referred to by several equivalent names.

stats::dist()

The dist() function is part of the stats package which is part of the base R installation and thus is automatically loaded.  It can calculate 6 different distance measures.  Its usage is:
dist(x, method = "euclidean", diag = FALSE, upper = FALSE, p = 2)

The key arguments are:

  • x – the data matrix, data frame, or distance matrix to be analyzed
  • method – the distance measure to be used.  There are 6 distance measures available:
    • euclidean – the default
    • maximum – the largest absolute difference among all pairwise elements.
    • manhattan – aka City-block. Calculated as the sum of the absolute differences along all dimensions.
    • canberra – Sum across all pairwise elements of the absolute value of the difference divided by the absolute value of the sum.
    • binary – proportion of pairwise elements where one of the two is non-zero. Pairs where both are zero are ignored.
    • minkowski – generalized form of Euclidean and Manhattan distances (see p. 47 of McCune & Grace (2002) for equation). Requires an additional argument specifying the square root and power to be applied; the default is p = 2.
  • diag – print the diagonal of the distance matrix?  Default is FALSE (no).
  • upper – print the upper-triangle of the matrix?  Default is FALSE (no).
  • p – power; used only when calculating the Minkowski distance.

vegan::vegdist()

The vegdist() function in vegan currently includes 21 distance measures.  Its usage is:
vegdist(x, method = "bray", binary = FALSE, diag = FALSE, upper = FALSE, na.rm = FALSE, ...)

The key arguments are:

  • x – the data matrix to be analyzed, with plots as rows and variables as columns
  • method – type of distance measure to use.  Notes about a few of the distance measures follow; the others can be found in the R help files.
    • manhattan – identical to result from dist()
    • euclidean – identical to result from dist(), but not the default measure.
    • canberra – may yield different results than from dist()
    • clark
    • bray – Bray-Curtis. The default measure for this function.
    • kulczynski
    • jaccard – computed as 2B/(1+B), where B is the Bray-Curtis dissimilarity. The help file indicates that Bray-Curtis and Jaccard dissimilarities are rank-order similar and suggests that the Jaccard “probably should be preferred” because it is metric whereas Bray-Curtis is semimetric. However, I have not seen any formal evaluation of how sensitive conclusions are to this choice.
    • gower – the help file indicates that this version of Gower’s distance cannot handle mixed data (e.g., continuous variables and factors simultaneously). I have not tested this.  The cluster::daisy() function is recommended for these types of data.
    • altGower
    • morisita
    • horn
    • mountford
    • raup
    • binomial
    • chao – a measure that accounts for unseen species pairs (i.e., differences in sample size that particularly affect rare species). See Chao et al. (2005) for details.
    • cao
    • mahalanobis
    • chisq – Chi-square distances, as used in correspondence analysis.
    • chord
    • hellinger
    • aitchison
    • robust.aitchison
  • binary – whether to use decostand() to convert data to presence/absence before calculating distances. Default is FALSE (no).
  • diag – whether to return the diagonal of the distance matrix. Default is FALSE (no).
  • upper – whether to return the values in the upper triangle of the distance matrix. Default is FALSE (no).
  • na.rm – whether to delete missing observations (pairwise) when calculating dissimilarities. Default is FALSE (no).

labdsv::dsvdis()

The dsvdis() function in labdsv currently offers 7 distance measures.  Its usage is:

dsvdis(x, index, weight = rep(1, ncol(x)), step = 0.0, diag = FALSE, upper = FALSE)

The key arguments are:

  • x – the data matrix to be analyzed, with plots as rows and variables as columns
  • index – the distance measure to be used.  Note that this argument has a different name than the comparable argument in the other functions (‘index’ vs ‘method’) and does not have a default distance measure.  The available distance measures are:
    • steinhaus – aka Jaccard. Based on presence/absence (converts abundances automatically).
    • sorensen – based on presence/absence (converts abundances automatically).
    • ochiai – based on presence/absence (converts abundances automatically).
    • ruzicka
    • bray/curtis – Bray-Curtis. Note the slash in the name here.
    • roberts
    • chisq
  • weight – an opportunity to weight species differently during the calculation of distances.  The default (rep(1,ncol(x))) assigns all species the same value (1).
  • step – a threshold dissimilarity to initiate shortest-path adjustment.  This is likely to be most useful in datasets where there is complete species turnover (e.g., spanning very large gradients).  The default (0.0) means that no adjustments are made.  If a positive value is specified, any distances above that value are replaced by the distance of the shortest connected path between points less than the threshold apart.  The result is that the dissimilarity values are no longer bounded between 0 and 1.  To my knowledge, the consequences of these types of adjustments for ecological interpretation have not been rigorously assessed.
  • diag – whether to return data for the diagonal.  Default is FALSE (no).
  • upper – whether to return data in the upper triangle of the distance matrix (TRUE) or the lower triangle (FALSE).  Default is FALSE (no).

ecodist::distance()

For an introduction to the ecodist package, see Goslee & Urban (2007).  In ecodist, the distance() function can calculate 10 different distance measures.  Its usage is:

distance(x, method = "euclidean", sprange = NULL, spweight = NULL, icov, inverted = FALSE)

The key arguments are:

  • x – the data matrix or data frame to be analyzed, with plots as rows and variables as columns
  • index – the distance measure to be used.  There are 10 distance measures available:
    • euclidean – the default method for this function.
    • bray-curtis – note the dash in the name here. Can also be calculated directly using the bcdist() function in this package.
    • manhattan
    • mahalanobis
    • jaccard
    • difference
    • sorensen
    • gower – Gower’s distance.
    • modgower10 – variation of Gower’s distance measure, using base 10.
    • modgower2 – variation of Gower’s distance measure, using base 2.
  • sprange and spweight permit species data to be relativized during the distance calculation.  These only apply to a subset of the available distance measures, particularly Gower's distance and variations on it.
  • icov and inverted are used when calculating Mahalanobis distances.

 

This package also includes bcdist(), a function that returns the Bray-Curtis distance measure.  This is faster than distance(x, index = "bray-curtis"), and includes an option to drop empty rows or to set distances between empty rows to zero.

betapart

As noted above, compositional dissimilarities can be partitioned into components related to loss (subsetting) and substitution).  The betapart package provides functions to calculate three common compositional dissimilarities and to partition each dissimilarity into its components.  The components differ depending on whether the data are incidence-based (i.e., presence/absence; Jaccard or Sorensen) or abundance-based (Bray-Curtis).

The beta.pair() function can be used to calculate either Jaccard or Sorensen dissimilarities.  It returns three distance matrices, one for turnover, one for nestedness, and one for total dissimilarity.

The bray.part() function calculates three distance matrices, one for balanced variation, one for abundance gradients, and one for total Bray-Curtis dissimilarity.

Examples

Today’s Example Dataset

The dataset of plots 1-3 used to introduce the Euclidean distance measure is available as a text file (Legendre.Legendre.2012.p311.txt) in the book's GitHub folder.  Save it in the 'data' subfolder within your course folder.  Then, open your R project and load the data into R:
test <- read.table("data/Legendre.Legendre.2012.p311.txt", header = TRUE)

To see the Euclidean distances among plots in test:
(ED.test <- dist(test))

Note that the result is displayed as a lower triangular matrix.  What class is this object?

 

We could have seen the result as a full matrix by printing the diagonal and upper triangle during it's calculation:
dist(test, diag = TRUE, upper = TRUE)
However, we can also easily convert the result to a full matrix just by changing its class:
as.matrix(ED.test)

Oak Plant Community Dataset

Now, let’s calculate some distance matrices using our Oak plant community dataset.  Assuming you have already saved the files to your data folder, we begin by loading them:
Oak <- read.csv("data/Oak_data_47x216.csv", header = TRUE, row.names = 1)
Oak_species <- read.csv("data/Oak_species_189x5.csv", header = TRUE)

Euclidean Distances Among Stands (Physical Distance)

As we’ve already seen, this dataset includes a number of potential explanatory variables along with the abundances of a large number of species (see the ‘Oak_Metadata.docx’ file for details). We will focus on the latitude and longitude of each stand.  To do so, we need to extract these two variables and then use them to calculate a distance matrix. The Euclidean distance measure is a reasonable choice since these data are spatial coordinates.

library(vegan)
geog.dis <- Oak |>
select(LatAppx, LongAppx) |>
vegdist(method = "euc")

Bray-Curtis Distances Among Stands (Species Composition)

Now, let's calculate the compositional differences among the stands.  We will create an object that contains the compositional data:
Oak_abund <- Oak[ , colnames(Oak) %in% Oak_species$SpeciesCode]

 

As we proceed, let’s think about relativizations again. In this dataset, trees were measured in very different units than other growth forms so it makes sense to relativize each species by its maximum – this will put the data for all species on the same scale. (If appropriate for our objectives, we could also have made other adjustments, such as deleting rare species and relativizing by row totals.  For simplicity, we did not do so here.)  We can do this within our piped functions:

spp.dis <- Oak_abund |>
  decostand(method = "max") |>
  vegdist()

I didn’t specify a method for vegdist().  Why not?

Exploring and Comparing Distance Matrices

Use the str() function to view the details of geog.dis and spp.dis.  Note that they are stored as objects of class 'dist'.  Even though they appear as matrices, we can't use dim() to see their dimensionality; use length() instead.  

Verify that these distance matrices are the same size (1081 elements) - this is because they are based on the same sample units.  Why this many distances?  Recall from the last chapter the formula for the number of pairwise distances among n sample units.

 

Finally, note that the number of variables has no effect on the size of the resulting distance matrix.  For example, geog.dis was based on two variables and spp.dis was based on 189 variables.

 

Distance matrices like these are what we are going to utilize throughout the rest of the course.

Conclusions

The behavior of a distance measure is highly influenced by the data adjustments (deletions, transformations, standardizations/relativizations) applied prior to its calculation.  It bears repeating that the appropriateness of these various adjustments must be determined based on the questions you are addressing.

A single distance matrix always contains values calculated using the same distance measure - we would never combine in one distance matrix measures obtained using different distance measures.  However, we will see examples of how matrices derived from different data on the same sample units (as with geog.dis and spp.dis above) can be compared.

There are many distance measures to choose from.  While this can be confusing, it also gives flexibility.  Distance measures have different properties and/or make different assumptions about the data.  These properties and assumptions are important to keep in mind when choosing which distance measure to use.  Some analytical techniques implicitly assume a distance measure:

  • Correspondence Analysis (CA) and Canonical Correspondence Analysis (CCA) are based on chi-square distances
  • Principal Component Analysis (PCA) are based on Euclidean distances

In instances like these, any limitations of the assumed distance measure are carried over to the associated technique.  For example, if it is inappropriate to summarize a dataset using Euclidean distances, it would also be inappropriate to analyze that dataset using PCA.  Conversely, if Euclidean distances are appropriate to apply to a dataset, then PCA may be appropriate … depending on other considerations such as the degree of correlation.  See the 'PCA' chapter for additional information.

The peer-reviewed literature is one of the best ways to identify which distance measures to consider using.  Find papers that are using comparable datasets and asking comparable questions.  Which distance measure(s) do they use?

 

The fact that distance measures have different properties can be an asset.  For example, a compositional matrix could be summarized with a presence/absence-based measure and with an abundance-based measure.  Rare species are given more weight in a presence/absence measure whereas common species are given more weight in an abundance-based measure.  Therefore, if these two distance matrices were analyzed identically, differences between the resulting analyses would reflect whether patterns in the community are being driven by the rare or the common species (Lozupone et al. 2007).

In a recent study (Bakker et al. 2023), we compared abundance- and incidence-based measures of compositional variation at 60 grassland sites.  We focused on total dissimilarity (Bray-Curtis, Sorensen), the percentage of that variation due to balanced variation and to species turnover, alpha diversity (species per plot), and gamma diversity (species per site).  We found that grasslands differed greatly in the importance of these metrics, that they were associated different aspects of the environmental conditions at a site, and that they were moderately correlated with one another (see below figure).  We concluded that "our understanding of compositional variation at a site is enhanced by considering multiple metrics simultaneously" (p. 2).

Scatterplot matrix
Scatterplot matrix of metrics of compositional variation. Bray-Curtis dissimilarity and Sorensen dissimilarity are abundance-based and incidence-based measures of overall of variation. Balanced variation and species turnover are abundance-based and incidence-based measures of how much of the overall variation was due to substitution-related patterns rather than patterns of loss or subsetting. Alpha diversity is the mean number of species per plot. Gamma diversity is the total number of species per site.   Figure 3 from Bakker et al. (2023).

 

References

Anderson, M.J., T.O. Crist, J.M. Chase, M. Vellend, B.D. Inouye, A.L. Freestone, N.J. Sanders, H.V. Cornell, L.S. Comita, K.F. Davies, S.P. Harrison, N.J.B. Kraft, J.C. Stegen, and N.G. Swenson. 2011. Navigating the multiple meanings of β diversity: a roadmap for the practicing ecologist. Ecology Letters 14:19-28.

Bakker, J.D., J.N. Price, J.A. Henning, E.E. Batzer, T.J. Ohlert, C.E. Wainwright, P.B. Adler, J. Alberti, C.A. Arnillas, L.A. Biederman, E.T. Borer, L.A. Brudvig, Y.M. Buckley, M.N. Bugalho, M.W. Cadotte, M.C. Caldeira, J.A. Catford, Q. Chen, M.J. Crawley, P. Daleo, C.R. Dickman, I. Donohue, M.E. DuPre, A. Ebeling, N. Eisenhauer, P.A. Fay, D.S. Gruner, S. Haider, Y. Hautier, A. Jentsch, K. Kirkman, J.M.H. Knops, L.S. Lannes, A.S. MacDougall, R.L. McCulley, R.M. Mitchell, J.L. Moore, J.W. Morgan, B. Mortensen, H. Olde Venterink, P.L. Peri, S.A. Power, S.M. Prober, C. Roscher, M. Sankaran, E.W. Seabloom, M.D. Smith, C. Stevens, L.L. Sullivan, M. Tedder, G.F. Veen, R. Virtanen, and G.M. Wardle. 2023. Compositional variation in grassland plant communities. Ecosphere 14(6):e4542.

Baselga, A. 2010. Partitioning the turnover and nestedness components of beta diversity. Global Ecology and Biogeography 19:134-143.

Baselga, A. 2012. The relationship between species replacement, dissimilarity derived from nestedness, and nestedness. Global Ecology and Biogeography 21:1223-1232.

Baselga, A. 2013. Separating the two components of abundance-based dissimilarity: balanced changes in abundance vs. abundance gradients. Methods in Ecology and Evolution 4:552-557.

Borcard, D., F. Gillet, and P. Legendre. 2018. Numerical Ecology with R. 2nd edition. Springer, New York, NY.

Boyce, R.L., and P.C. Ellison. 2001. Choosing the best similarity index when performing fuzzy set ordination on binary data. Journal of Vegetation Science 12:711-720.

Bray, J.R., and J.T. Curtis. 1957. An ordination of the upland forest communities of southern Wisconsin. Ecological Monographs 27:325-349.

Cadotte, M.W., and T.J. Davies. 2016. Phylogenies in Ecology: A Guide to Concepts and Methods. Princeton University Press, Princeton, NJ.

Chao, A., R.L. Chazdon, R.K. Colwell, and T-J. Shen. 2005. A new statistical approach for assessing similarity of species composition with incidence and abundance data. Ecology Letters 8:148-159.

Chen, J., K. Bittinger, E.S. Charlson, C. Hoffmann, J. Lewis, G.D. Wu, R.G. Collman, F.D. Bushman, and H. Li. 2012. Associating microbiome composition with environmental covariates using generalized UniFrac distances. Bioinformatics. 28(16):2106-2113.

Clarke, K.R., P.J. Somerfield, and M.G. Chapman. 2006. On resemblance measures for ecological studies, including taxonomic dissimilarities and a zero-adjusted Bray-Curtis coefficient for denuded assemblages. Journal of Experimental Marine Biology and Ecology 330:55-80.

De’ath, G. 1999. Extended dissimilarity: a method of robust estimation of ecological distances from high beta diversity data. Plant Ecology 144:191-199.

Ertsgaard, E.W., N.L. Gjording, J.D. Bakker, J.A. Kleinkopf, and D.E. Giblin. 2025. Geology and climate drive alpine plant compositional variation among peaks in the Cascade Range of Washington. PLoS ONE 20(1):e0317140.

Faith, D.P., P.R. Minchin, and L. Belbin. 1987. Compositional dissimilarity as a robust measure of ecological distance. Vegetatio 69:57-68.

Goslee, S.C., and D.L. Urban. 2007. The ecodist package for dissimilarity-based analysis of ecological data. Journal of Statistical Software 22(7):1-19.

Gower, J.C. 1971. A general coefficient of similarity and some of its properties. Biometrics 27:857–874.

Greenacre, M., and R. Primicerio. 2013. Measures of distance between samples: Non-Euclidean. Ch. 5 in Multivariate Analysis of Ecological Data. Fundación BBVA, Bilbao, Spain.

Jaccard, P. 1912. The distribution of the flora in the alpine zone. The New Phytologist 11:37-50.

Legendre, P., and M. De Cáceres. 2013. Beta diversity as the variance of community data: dissimilarity coefficients and partitioning. Ecology Letters 16:951-963.

Legendre, P., and L. Legendre. 2012. Numerical Ecology. 3rd English Edition. Elsevier, Amsterdam, The Netherlands.

Lozupone, C., and R. Knight. 2005. UniFrac: a new phylogenetic method for comparing microbial communities. Applied and Environmental Microbiology 71(12):8228-8235.

Lozupone, C., M. Hamady, S.T. Kelley, and R. Knight. 2007. Quantitative and qualitative β diversity measures lead to different insights into factors that structure microbial communities. Applied and Environmental Microbiology 73(5):1576-1585.

Mahalanobis, P.C. 1936. On the generalised distance in statistics. Proceedings of the National Institute of Sciences of India 2(1):49-55.

McCune, B., and J.B. Grace. 2002. Analysis of Ecological Communities. MjM Software Design, Gleneden Beach, OR.

Plantinga, A.M., J. Chen, R.R. Jenq, and M.C. Wu. 2019. pldist: ecological dissimilarities for paired and longitudinal microbiome association analysis. Bioinformatics 35(19):3567-3575.

Sørensen, T.J. 1948. A method of establishing groups of equal amplitude in plant sociology based on similarity of species content and its application to analyses of the vegetation on Danish commons. Kongelige Danske videnskabernes selskab 5(4):1-34.  [need to verify citation]

Media Attributions

  • Jaccard.contingency.table
  • Baselga.2010_Figure1 © Baselga 2010
  • Baselga.2013_Figure1 © Baselga 2013
  • Bakker.et.al.2023_Figure3 © Bakker et al. 2023

License

Icon for the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

Applied Multivariate Statistics in R Copyright © 2026 by Jonathan Bakker is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, except where otherwise noted.