Common Distance Measures

Jonathan D. Bakker

Foundational Concepts

12 Common Distance Measures

Learning Objectives

To consider a range of distance measures used with ecological data, and the types of data for which they are appropriate.

To consider whether shared absences matter and how to deal with empty sample units.

To continue using R.

Resources

Legendre & De Cáceres (2013)

Key Packages

require(tidyverse, vegan, labdsv, ecodist, betapart)

Introduction

Distance measures are an essential component of many ecological analyses. There are many to choose from; Legendre & De Cáceres (2013) compared 16 of them, and Legendre & Legendre (2012, Table 7.2) list 26 of them! Borcard et al. (2018) devote an entire chapter to distance measures, using two primary criteria to organize their discussion:

What are the distributional characteristics of the data? For example, are the data binary (e.g., presence/absence) or continuously distributed (e.g., abundance)?
Are shared absences meaningful? A shared absence is a species (or other variable) that is absent from both sample units under consideration. Data are ‘symmetric’ if a shared absence is meaningful, and ‘asymmetric’ if it is not. Symmetry in this case means that two samples having the same zero value is as meaningful as those samples having another value the same (e.g., if the value of a variable was 1.5 for both samples). Species composition data are a prime example of a situation where shared absences are not meaningful, as discussed in the ‘Two Issues to Consider’ section below.

Combinations of these criteria are most appropriately handled using different types of distance measures. A few distance measures are identified here; those discussed below are in bold.

Species data (asymmetric)

Non-species data (symmetric)

Quantitative (continuous)

Bray-Curtis

Chi-Square

UniFrac

Euclidean

Manhattan

Mixed, including categorical

Gower

Binary

Jaccard

Sorenson

UniFrac

Simple matching

Note: the simple matching distance measure is not commonly used in ecology, and is not discussed here.

Euclidean Distance

The Pythagorean theorem is easily visualized in two dimensions, as we did in the last chapter. It can also be applied to more dimensions, though it rapidly becomes difficult to visualize this.

The formula for the Euclidean Distance (ED) between samples i and h across p dimensions is:

[latex]ED = \sqrt{\sum_{j=1}^p(a_{hj} - a_{ij})^2}[/latex]

Here is a dataset reporting the presence or absence of each of five species (variables) on three plots:

Plot	SppA	SppB	SppC	SppD	SppE
1	1	1	1	0	0
2	0	0	0	1	1
3	1	1	1	1	1

(Source: Legendre & Legendre 2012, p. 311)

What is the Euclidean distance (ED) between each pair of plots?

ED(1,2) = ________

ED(1,3) = ________

ED(2,3) = ________

Verify that the distance from a plot to itself is zero (property 1), and that the distance from plot 1 to plot 2 is the same as the distance from plot 2 to plot 1 (property 3). Euclidean distances can take any non-negative value from 0 to infinity (property 2).

Euclidean distances can be calculated using positive and negative values. One potential limitation of this distance measure is that the calculated distance depends on the scale of the variables. For example, if variables are measured using very different scales (e.g., biomass in g for forbs, Mg for trees), the distances will be disproportionately affected by the variables measured using the larger scales (Legendre & Legendre 2012). However, this can be addressed by relativizing the variables appropriately before calculating distances.

Euclidean distances are appropriate for many types of data, including geographic distances. However, Euclidean distances are generally inappropriate for community data (e.g., a plot x species matrix containing the cover or presence/absence of multiple species). Why? One reason is that it’s possible for two samples with no species in common to have a smaller Euclidean distance than two samples that share species. For example, compare the Euclidean distances among the following plots:

Plot	SppA	SppB	SppC
4	0	4	8
5	0	1	1
6	1	0	0

(Source: modified from Legendre & Legendre 2012, Figure 7.8)

ED(4,5) = ________

ED(4,6) = ________

ED(5,6) = ________

Verify that plots 5 and 6 are more similar than plots 4 and 5. Many ecologists find this unsatisfying because plots 5 and 6 have no species in common whereas plots 4 and 5 share the same species and differ only in abundance. Instead, they would argue that the presence of the same species is more important than a difference in abundance of that species.

Manhattan Distance

Euclidean distances are calculated by squaring the difference associated with each variable, but an alternative is to simply add the (absolute) differences. This is analogous to summing the two perpendicular sides of a triangle rather than using the Pythagorean theorem to calculate the hypotenuse.

The formula for the Manhattan distance between samples i and h across p dimensions is:

[latex]MD = \sum_{j=1}^p \mid ( a_{hj} - a_{ij} ) \mid[/latex]

In the last chapter, we used the Euclidean distance to calculate the hypotenuse between two sample units based on the numbers of annual and perennial species. The Manhattan distance between these sample units is the sum of the two perpendicular sides of the triangle.

This is also called the city-block distance … can you see why?

Jaccard Similarity and Dissimilarity

Let’s consider presence/absence data some more. For any two plots, species occurrences can be summarized in a contingency table:

		Plot B
		Present	Absent
Plot A	Present	a	b
	Absent	c	d

Note that this is not a data matrix. Rather:

a is the number of species that are present in both plots
b is the number of species that are present in plot A but missing from plot B
c is the number of species that are missing from plot A but present in plot B
d is the number of species that are missing from both plots.

Jaccard (1912) proposed that we quantify the proportion of species that are present in both samples. This is known as Jaccard similarity ([latex]S_{J}[/latex]):

[latex]S_{J} = \frac{a}{a + b + c}[/latex]

As a proportion, this value is bounded between 0 (no shared species) and 1 (all shared species). This is a metric measure.

Note that species that are missing from both plots (d) are not included in this calculation; see the ‘Two Issues to Consider’ section below for more information on this.

Since Jaccard similarity has an upper bound of 1, it is converted to Jaccard dissimilarity ([latex]D_{J}[/latex]) by subtraction. [latex]D_{J}[/latex] can also be calculated directly from the contingency table of species occurrences:

[latex]D_{J} = 1 - S_{J} = \frac{b + c}{a + b + c}[/latex]

Jaccard dissimilarity is the proportion of species that are absent from one of the samples.

Refer back to plots 1-3 for which we calculated Euclidean distances. What is the Jaccard dissimilarity between each pair of plots?

	Plot1	Plot2
Plot2
Plot3

Note: Recent work has decomposed or partitioned Jaccard dissimilarities into two components, turnover and nestedness (Baselga 2010, 2012). Turnover is species replacement (one species replaced with another), while nestedness is the extent to which the composition of one sample unit is a subset of the composition of another sample unit. See the description of the betapart package below for more information.

Sorensen Similarity and Dissimilarity

Sorensen similarity ([latex]S_{S}[/latex]) is the proportion of species that are present in both samples, while accounting for differences in species richness between samples. Using the same terminology as for Jaccard similarity, the formula is:

[latex]S_{S} = \frac{a}{\frac{(a + b) + (a + c)}{2}} = \frac{2a}{2a + b + c}[/latex]

Like Jaccard similarity, this value is bounded between 0 (no shared species) and 1 (all shared species). Unlike Jaccard, however, it is semimetric.

Several people proposed this distance measure independently; the original publication by Sørensen is from 1948.

Since [latex]S_{S}[/latex] is a proportion, it can be converted to Sorensen dissimilarity ([latex]D_{S}[/latex]) by subtraction. [latex]D_{S}[/latex] can also be calculated directly from the contingency table:

[latex]D_{S} = 1 - S_{S} = 1 - \frac{2a}{2a + b + c} = \frac{b + c}{2a + b + c}[/latex]

Sorensen dissimilarity is the proportion of species that are absent from one of the samples.

Refer back to plots 1-3. What is the Sorensen dissimilarity between each pair of plots?

	Plot1	Plot2
Plot2
Plot3

Notice that these data do not satisfy the triangle inequality: the dissimilarity from plot 1 to 3 plus the dissimilarity from plot 3 to plot 2 is less than the dissimilarity from plot 1 to 2. This demonstrates that the Sorensen dissimilarity is a semimetric measure.

Bray-Curtis Distance

When the formula for Sorensen dissimilarity is extended from presence/absence data to species abundance data, it results in the Bray-Curtis distance measure:

[latex]D_{i,h} = \frac{\sum_{j=1}^p \mid a_{ij} - a_{hj} \mid} { \sum_{j=1}^p a_{ij} + \sum_{j=1}^p a_{hj} } = 1 - \frac{ 2 \sum_{j=1}^p MIN(a_{ij}, a_{hj}) }{ \sum_{j=1}^p a_{ij} + \sum_{j=1}^p a_{hj} } = 1 - \frac{ 2 \sum_{j=1}^p MIN(a_{ij}, a_{hj}) }{ a_{i\cdot} + a_{h\cdot} }[/latex]

where

[latex]p[/latex] is the total number of species
[latex]a_{ij}[/latex] is the abundance of species j in sample unit i
[latex]a_{hj}[/latex] is the abundance of species j in sample unit h
[latex]a_{i\cdot}[/latex] is the total abundance of all species in sample unit i
[latex]a_{h\cdot}[/latex] is the total abundance of all species in sample unit h

These formulae are from chapter 6 of McCune & Grace (2002). The middle and right-hand versions are the same except that in the right-hand one I used the same terminology in the denominator as in the formula for the chi-square distance below. This is to permit easier comparisons between the two measures. Note that these formulae are based on the data matrix, not on the contingency table that was the basis of the Sorensen dissimilarity.

The Bray-Curtis distance measure is bounded between 0 (the sample units are identical) and 1 (the sample units are completely different), and is semimetric.

The Bray-Curtis distance measure is named after the co-authors of the paper in which it was used (Bray & Curtis 1957). However, and confusingly, it is also known by many other names: Steinhaus, Czekanowski, Sorensen, and percentage difference. Often this is because the same measure was proposed independently or because two measures were proposed that were later shown to be mathematically equivalent.

Several studies (notably, Faith et al 1987) have concluded that the Bray-Curtis distance measure functions best for community data (e.g., a plot x species matrix). We will see it throughout this course.

Borcard et al. (2018) note that the Bray-Curtis distance “gives the same importance to absolute differences in abundance irrespective of the order of magnitude of the abundances … a difference of 5 individuals has the same weight when the abundances are 3 and 8 as when the abundances are 6203 and 6208” (p.39). If this is problematic, the data can be log-transformed before computing distances.

Note: Recent work has decomposed or partitioned Bray-Curtis distances into two components, one related to ‘balanced variation in abundance’ and the other to ‘abundance gradients’ (Baselga 2013). These components are analogous to the turnover and nestedness components of Jaccard dissimilarities. See the description of the betapart package below for more information.

Chi-Square Distance

The Chi-square distance measure is the basis of an ordination technique known as correspondence analysis, which, with its variants, is popular in some quarters. Simulation tests have found that chi-square distances do not perform well with community data (Faith et al. 1987), but the popularity of correspondence analysis means that it is helpful to be familiar with this measure.

The formula for chi-square distance is:

[latex]D_{i,h} = \sqrt {\sum_{j=1}^p \frac{1}{a_{\cdot j}} [ \frac{a_{hj}}{a_{h\cdot}} - \frac{a_{ij}}{a_{i\cdot}} ]^2}[/latex]

where

[latex]p[/latex] is the total number of species
[latex]a_{ij}[/latex] is the abundance of species j in sample unit i
[latex]a_{hj}[/latex] is the abundance of species j in sample unit h
[latex]a_{i\cdot}[/latex] is the total abundance of all species in sample unit i
[latex]a_{h\cdot}[/latex] is the total abundance of all species in sample unit h
[latex]a_{\cdot j}[/latex] is the total abundance of species j across all sample units

This formula is from chapter 6 of McCune & Grace (2002).

Like Euclidean distances, this measure involves summing squared differences. However, the chi-square distance measure also:

Expresses the abundance of each species ([latex]a_{hj}[/latex] and [latex]a_{ij}[/latex]) as a proportion of the total abundance on the sample unit ([latex]a_{h\cdot}[/latex] and [latex]a_{i\cdot}[/latex]). In other words, relativization by row total is built into this distance measure.
Weights the squared difference by the inverse of the total abundance of the species ([latex]a_{\cdot j}[/latex]). This is also a built-in relativization, but it’s not one of the common ones that we’ve discussed previously (e.g., relativization by column maximum). This is relativization by column total.

This last aspect – weighting by the total abundance of the species – is where some of the problems arise. It means, for example, that:

The distance between two sample units depends on which other sample units are included in the data matrix.
Since this is an inverse weighting, common species are downplayed and rare species are weighted more strongly.

UniFrac Distance

In the above distances, each species is considered individually and the distance between them are summed to determine the total distance between sample units. Another approach is to account for the degree of relatedness among species. For example, consider these three plots (for simplicity, I only show three species):

Plot	Poa pratensis	Poa compressa	Hypochaeris radicata
7	5	0	0
8	0	5	0
9	0	0	5

By any of the above distance measures, the magnitude of the difference is the same between any two plots. However, the species in plots 7 and 8 are from the same genus and thus arguably these plots are more similar to one another than to plot 9.

UniFrac incorporates phylogenetic information about the taxa in the distance between sample units. The taxa present in two sample units are placed on a phylogenetic tree. Every branch in this tree ends at a taxon, and the length of the branch is a measure of how related that taxon is to the next one to which it connects. A branch is coded as ‘shared’ if the taxon is present in both sample units and as ‘unshared’ if the taxon is only present in one sample unit. I haven’t shown the distance formula here, but it is simply the sum of unshared branch lengths as a proportion of the total of all tree lengths. This is a metric measure with values ranging from 0 to 1.

The original UniFrac measure (Lozupone & Knight 2005) is unweighted, meaning that it gives equal weight to each taxa (analogous to Jaccard and Sorenson dissimilarities). There is also a weighted UniFrac which accounts for the relative abundances of taxa (analogous to Bray-Curtis distance) as well as a generalized UniFrac that combines the weighted and unweighted approaches in a single framework (Chen et al. 2012). More recently, this approach has been adapted for use in paired and longitudinal designs as are commonly used in microbiome studies (Plantinga et al. 2019).

This distance measure is not as easy to apply as the others as it requires phylogenetic information in addition to the standard composition matrix (sample units x species). It obviously is only relevant for compositional data.

There are a large number of other phylogenetic distances available. Cadotte & Davies (2016) provide examples of how to build phylogenetic trees and use them to calculate a wide range of metrics (see their Table 3.3).

Gower’s Distance

Most distance measures assume that the underlying data are continuously distributed, but Gower (1971) proposed a generalized coefficient of dissimilarity that can be applied to a dataset consisting of a variety of data types: continuously distributed, nominal, and/or ordinal variables. Greenacre & Primicerio (2013) provide a nice description of this approach. Gower’s distance is available in several of the functions summarized below.

Mahalanobis Distance

Mahalanobis (1936) proposed a way to compare a sample to sets of other samples and determine how likely the sample belongs to each set. The formula includes the covariance matrix to account for differences in variability among variables. It is intended for multivariate normal data.

One way the Mahalanobis distance can be used is to evaluate whether individual sample units are outliers relative to the rest of the data – see the ‘Multivariate Outlier Analysis’ chapter for more information.

Should Shared Absences Matter?

When summarizing species data, ecologists generally prefer distance measures that are not affected by species that are absent from the two experimental units being compared (cell d in the contingency table shown in the description of Jaccard similarity above). Species could be absent for any number of reasons, and it generally doesn’t make sense to determine the similarity between two samples based on species that are present in neither.

However, some research has shown that species absences can be informative when dealing with datasets that contain high beta diversity (species turnover among plots). For more information about ‘extended dissimilarities’, see De’ath (1999) and Boyce & Ellison (2001), the discussion of the dsvdis() function below, and the help files for the stepacross() function in vegan. Note that using a function such as stepacross() will change how distances are calculated such that they no longer have a fixed upper bound. Anderson et al. (2011) provide an extended discussion about beta diversity.

Dealing With Empty Sample Units

We’ve mentioned before that most distance measures assume that at least one species is present in all plots. This can be problematic if you are dealing with a community in which organisms are heterogeneously distributed – their absence from a given area may be very important information! For example, consider these data:

Plot	SppA	SppB	SppC
4	0	4	8
5	0	1	1
6	1	0	0
10	0	0	0

These are plots 4-6 from above, together with plot 10 which is empty – there were no species in this sample unit. There are two common ways to deal with this.

First, you could delete the empty sample units, as we discussed earlier (see ‘Data Adjustment’ chapter). However, this changes the questions being asked – for example, from “how does composition differ among all plots?” to “how does composition differ among vegetated plots?”.

Second, you could use a ‘zero-adjusted’ distance measure (Clarke et al. 2006). This simply involves adding a ‘pseudo-species’ with the same small cover value to all plots. Often, the cover value used is the minimum value that a species would be assigned if present on a plot. This pseudo-species is added to all plots – not just those that are empty – so that it alters all distances in the same manner. This approach can be applied to data subject to any distance measure. You would want to exclude pseudo-species when calculating metrics such as species richness.

Using one of the functions described below, verify that you cannot calculate the Bray-Curtis distance between plot 10 and any of the other plots. Then, add a ‘dummy’ species with an abundance of 1 in all plots and re-calculate the distance matrix.

Some code that may be of interest:
eg <- data.frame( Plot = c("p4", "p5", "p6", "p10"), SppA = c(0, 0, 1, 0), SppB = c(4, 1, 0, 0), SppC = c(8, 1, 0, 0) ) Spp <- c("SppA", "SppB", "SppC") vegan::vegdist(eg[ 1:3 , colnames(eg) %in% c(Spp)]) eg$SppDummy <- 1 vegan::vegdist(eg[ , colnames(eg) %in% c(Spp, "SppDummy")])

R Functions to Calculate Distance Measures

The above distance measures, and others, can be calculated using many R functions, including:

dist() in stats
vegdist() and betadiver() in vegan (this package also includes the designdist() function for constructing your own distance measure)
dsvdis() in labdsv
distance() and bcdist() in ecodist
gdist() and xdiss() in mvpart
daisy() in cluster
gowdis() in FD
beta.pair() and bray.part() in betapart

We’ll survey a few of these functions here. Don’t forget to load these packages before calling their functions! Details about the measures calculated by each function can be obtained through the R help files.

Keep in mind that many distance measures are referred to by several equivalent names.

`stats::dist()`

The dist() function is part of the stats package which is part of the base R installation and thus is automatically loaded. It can calculate 6 different distance measures. Its usage is:
dist(x, method = "euclidean", diag = FALSE, upper = FALSE, p = 2)

The key arguments are:

x – the data matrix, data frame, or distance matrix to be analyzed
method – the distance measure to be used. There are 6 distance measures available:
- euclidean – the default
- maximum – the largest absolute difference among all pairwise elements.
- manhattan – aka City-block. Calculated as the sum of the absolute differences along all dimensions.
- canberra – Sum across all pairwise elements of the absolute value of the difference divided by the absolute value of the sum.
- binary – proportion of pairwise elements where one of the two is non-zero. Pairs where both are zero are ignored.
- minkowski – generalized form of Euclidean and Manhattan distances (see p. 47 of McCune & Grace (2002) for equation). Requires an additional argument specifying the square root and power to be applied; the default is p = 2.

diag – print the diagonal of the distance matrix? Default is FALSE (no).
upper – print the upper-triangle of the matrix? Default is FALSE (no).
p – power; used only when calculating the Minkowski distance.

`vegan::vegdist()`

The vegdist() function in vegan currently includes 21 distance measures. Its usage is:
vegdist(x, method = "bray", binary = FALSE, diag = FALSE, upper = FALSE, na.rm = FALSE, ...)

The key arguments are:

x – the data matrix to be analyzed, with plots as rows and variables as columns
method – type of distance measure to use. Notes about a few of the distance measures follow; the others can be found in the R help files.

- manhattan – identical to result from dist()
- euclidean – identical to result from dist(), but not the default measure.
- canberra – may yield different results than from dist()
- clark
- bray – Bray-Curtis. The default measure for this function.
- kulczynski
- jaccard – computed as 2B/(1+B), where B is the Bray-Curtis dissimilarity. The help file indicates that Bray-Curtis and Jaccard dissimilarities are rank-order similar and suggests that the Jaccard “probably should be preferred” because it is metric whereas Bray-Curtis is semimetric. However, I have not seen any formal evaluation of how sensitive conclusions are to this choice.
- gower – the help file indicates that this version of Gower’s distance cannot handle mixed data (e.g., continuous variables and factors simultaneously). I have not tested this.
- altGower
- morisita
- horn
- mountford
- raup
- binomial
- chao – a measure that accounts for unseen species pairs (i.e., differences in sample size that particularly affect rare species). See Chao et al. (2005) for details.
- cao
- mahalanobis
- chisq – Chi-square distances, as used in correspondence analysis.
- chord
- aitchison
- robust.aitchison
binary – whether to use decostand() to convert data to presence/absence before calculating distances. Default is FALSE (no).
diag – whether to return the diagonal of the distance matrix. Default is FALSE (no).
upper – whether to return the values in the upper triangle of the distance matrix. Default is FALSE (no).
na.rm – whether to delete missing observations (pairwise) when calculating dissimilarities. Default is FALSE (no).

`labdsv::dsvdis()`

The dsvdis() function in labdsv currently offers 7 distance measures. Its usage is:

dsvdis(x, index, weight = rep(1, ncol(x)), step = 0.0, diag = FALSE, upper = FALSE)

The key arguments are:

x – the data matrix to be analyzed, with plots as rows and variables as columns
index – the distance measure to be used. Note that this argument has a different name than the comparable argument in the other functions (‘index’ vs ‘method’) and does not have a default distance measure. The available distance measures are:

- steinhaus – aka Jaccard. Based on presence/absence (converts abundances automatically).
- sorensen – based on presence/absence (converts abundances automatically).
- ochiai – based on presence/absence (converts abundances automatically).
- ruzicka
- bray/curtis – Bray-Curtis. Note the diagonal in the name here.
- roberts
- chisq

weight – an opportunity to weight species differently during the calculation of distances. The default (rep(1,ncol(x))) assigns all species the same value (1).
step – a threshold dissimilarity to initiate shortest-path adjustment. This is likely to be most useful in datasets where there is complete species turnover (e.g., spanning very large gradients). The default (0.0) means that no adjustments are made. If a positive value is specified, any distances above that value are replaced by the distance of the shortest connected path between points less than the threshold apart. The result is that the dissimilarity values are no longer bounded between 0 and 1. To my knowledge, the consequences of these types of adjustments for ecological interpretation have not been rigorously assessed.
diag – whether to return data for the diagonal. Default is FALSE (no).
upper – whether to return data in the upper triangle of the distance matrix (TRUE) or the lower triangle (FALSE). Default is FALSE (no).

`ecodist::distance()`

For an introduction to the ecodist package, see Goslee & Urban (2007). In ecodist, the distance() function can calculate 10 different distance measures. Its usage is:

distance(x, method = "euclidean", sprange = NULL, spweight = NULL, icov, inverted = FALSE)

The key arguments are:

x – the data matrix or data frame to be analyzed, with plots as rows and variables as columns
index – the distance measure to be used. There are 10 distance measures available:

- euclidean – the default method for this function.
- bray-curtis – note the dash in the name here. Can also be calculated directly using the bcdist() function in this package.
- manhattan
- mahalanobis
- jaccard
- difference
- sorensen
- gower – Gower’s distance.
- modgower10 – variation of Gower’s distance measure, using base 10.
- modgower2 – variation of Gower’s distance measure, using base 2.

sprange and spweight permit species data to be relativized during the distance calculation. These only apply to a subset of the available distance measures, particularly Gower’s distance and variations on it.
icov and inverted are used when calculating Mahalanobis distances.

This package also includes bcdist(), a function that returns the Bray-Curtis distance measure. This is faster than distance(x, index = "bray-curtis"), and includes an option to drop empty rows or to set distances between empty rows to zero.

`betapart`

The betapart package provides functions to partition dissimilarity measures into their components.

Incidence-based metrics can be partitioned into the variation associated with turnover and with nestedness. The beta.pair() function calculates three distance matrices, one for turnover, one for nestedness, and one for total dissimilarity. It can be used to calculate either Jaccard or Sorensen dissimilarities.

The components of abundance-based metrics are similar to those of the incidence-based metrics but a bit nuanced: balanced variation in abundance instead of turnover, and abundance gradients instead of nestedness. The bray.part() function calculates three distance matrices, one for balanced variation, one for abundance gradients, and one for total Bray-Curtis dissimilarity.

The components of dissimilarity are additive – the sum of the first two distance matrices calculated using either method above is equal to the third distance matrix (total dissimilarity). The components can be analyzed separately but, since they are aspects of the same value, I often relativize them and consider the proportion of the total dissimilarity that is attributable to one of the two components.

Examples

Today’s Example Dataset

The dataset of plots 1-3 used to introduce the Euclidean distance measure is available as a text file (Legendre.Legendre.2012.p311.txt) in the course GitHub folder. Save it in the ‘data’ subfolder within your course folder. Then, open your R project and load the data into R:
test <- read.table("data/Legendre.Legendre.2012.p311.txt", header = TRUE)

To see the Euclidean distances among plots in test:
(ED.test <- dist(test))

Note that the result is displayed as a lower triangular matrix. What class is this object? It can easily be converted to a full matrix:
as.matrix(ED.test)

Oak Plant Community Dataset

Now, let’s calculate a distance matrix using the geographic data associated with our Oak plant community dataset. Assuming you have already saved the files to your data folder, we begin by loading them:
Oak <- read.csv("data/Oak_data_47x216.csv", header = TRUE, row.names = 1)
Oak_species <- read.csv("data/Oak_species_189x5.csv", header = TRUE)

Create an object containing the response data:
Oak_abund <- Oak[ , colnames(Oak) %in% Oak_species$SpeciesCode]

As we’ve already seen, this dataset includes a number of potential explanatory variables along with the abundances of a large number of species (see the ‘Oak_Metadata.docx’ file for details). We will begin by focusing on the latitude and longitude of each stand. We will extract these variables and then use them to calculate a distance matrix. The Euclidean distance measure is a reasonable choice since these data are spatial coordinates.
library(vegan)
geog.dis <- Oak[,c("LatAppx","LongAppx")] |> vegdist(method = "euc")

The resulting object contains 1081 distances:
length(geog.dis)

Why this many distances? Recall from the last chapter the formula for the number of pairwise distances among n sample units.

Note that the number of variables has no effect on the size of the resulting distance matrix. We will illustrate this by also calculating a distance matrix based on the species abundance data. Before doing so, let’s think about relativizations again. In this dataset, trees were measured in very different units than other growth forms so it makes sense to relativize each species by its maximum – this will put the data for all species on the same scale. (If appropriate for our objectives, we could also have made other adjustments, such as deleting rare species and relativizing by row totals. For simplicity, we did not do so here.) We can do this within our piped functions:
spp.dis <- Oak_abund |>
decostand(method = "max") |>
vegdist()

I didn’t specify a method for vegdist(). Why not?

Use the str() function to view the details of geog.dis and spp.dis. Verify that they are the same size, even though geog.dis was based on two variables and spp.dis was based on 189 variables.

Distance matrices like these are what we are going to utilize throughout the rest of the course.

Conclusions

The behavior of a distance measure is highly influenced by the data adjustments (deletions, transformations, standardizations/relativizations) applied prior to its calculation. It bears repeating that the appropriateness of these various adjustments must be determined based on the questions you are addressing.

A single distance matrix always contains values calculated using the same distance measure – we would never combine in one distance matrix measures obtained using different distance measures. However, we will see examples of how matrices derived from different data on the same sample units (as with geog.dis and spp.dis above) can be compared.

There are many distance measures to choose from. While this can be confusing, it also gives flexibility. Distance measures have different properties and/or make different assumptions about the data. These properties and assumptions are important to keep in mind when choosing which distance measure to use. Some analytical techniques implicitly assume a distance measure:

Correspondence Analysis (CA) and Canonical Correspondence Analysis (CCA) are based on chi-square distances
Principal Component Analysis (PCA) are based on Euclidean distances

In instances like these, any limitations of the assumed distance measure are carried over to the associated technique. For example, if it is inappropriate to summarize a dataset using Euclidean distances, it would also be inappropriate to analyze that dataset using PCA. Conversely, if Euclidean distances are appropriate to apply to a dataset, then PCA may be appropriate … depending on other considerations such as the degree of correlation. See the ‘PCA’ chapter for additional information.

The fact that distance measures have different properties can be an asset. For example, a compositional matrix could be summarized with a presence/absence-based measure and with an abundance-based measure. Rare species are given more weight in a presence/absence measure whereas common species are given more weight in an abundance-based measure. Therefore, if these two distance matrices were analyzed identically, differences between the resulting analyses would reflect whether patterns in the community are being driven by the rare or the common species (Lozupone et al. 2007). In a recent study (Bakker et al. 2023) I compared abundance- and incidence-based measures of grassland compositional variation. We found that grasslands differed greatly in the importance of these two measures, and that these measures were correlated with different aspects of the environmental conditions at a site.

References

Anderson, M.J., T.O. Crist, J.M. Chase, M. Vellend, B.D. Inouye, A.L. Freestone, N.J. Sanders, H.V. Cornell, L.S. Comita, K.F. Davies, S.P. Harrison, N.J.B. Kraft, J.C. Stegen, and N.G. Swenson. 2011. Navigating the multiple meanings of β diversity: a roadmap for the practicing ecologist. Ecology Letters 14:19-28.

Bakker, J.D., J.N. Price, J.A. Henning, E.E. Batzer, T.J. Ohlert, C.E. Wainwright, P.B. Adler, J. Alberti, C.A. Arnillas, L.A. Biederman, E.T. Borer, L.A. Brudvig, Y.M. Buckley, M.N. Bugalho, M.W. Cadotte, M.C. Caldeira, J.A. Catford, Q. Chen, M.J. Crawley, P. Daleo, C.R. Dickman, I. Donohue, M.E. DuPre, A. Ebeling, N. Eisenhauer, P.A. Fay, D.S. Gruner, S. Haider, Y. Hautier, A. Jentsch, K. Kirkman, J.M.H. Knops, L.S. Lannes, A.S. MacDougall, R.L. McCulley, R.M. Mitchell, J.L. Moore, J.W. Morgan, B. Mortensen, H. Olde Venterink, P.L. Peri, S.A. Power, S.M. Prober, C. Roscher, M. Sankaran, E.W. Seabloom, M.D. Smith, C. Stevens, L.L. Sullivan, M. Tedder, G.F. Veen, R. Virtanen, and G.M. Wardle. 2023. Compositional variation in grassland plant communities. Ecosphere 14(6):e4542. https://doi.org/10.1002/ecs2.4542.

Baselga, A. 2010. Partitioning the turnover and nestedness components of beta diversity. Global Ecology and Biogeography 19:134-143.

Baselga, A. 2012. The relationship between species replacement, dissimilarity derived from nestedness, and nestedness. Global Ecology and Biogeography 21:1223-1232.

Baselga, A. 2013. Separating the two components of abundance-based dissimilarity: balanced changes in abundance vs. abundance gradients. Methods in Ecology and Evolution 4:552-557.

Borcard, D., F. Gillet, and P. Legendre. 2018. Numerical ecology with R. 2nd edition. Springer, New York, NY.

Boyce, R.L., and P.C. Ellison. 2001. Choosing the best similarity index when performing fuzzy set ordination on binary data. Journal of Vegetation Science 12:711-720.

Bray, J.R., and J.T. Curtis. 1957. An ordination of the upland forest communities of southern Wisconsin. Ecological Monographs 27:325-349.

Cadotte, M.W., and T.J. Davies. 2016. Phylogenies in ecology: a guide to concepts and methods. Princeton University Press, Princeton, NJ.

Chao, A., R.L. Chazdon, R.K. Colwell, and T-J. Shen. 2005. A new statistical approach for assessing similarity of species composition with incidence and abundance data. Ecology Letters 8:148-159.

Chen, J., K. Bittinger, E.S. Charlson, C. Hoffmann, J. Lewis, G.D. Wu, R.G. Collman, F.D. Bushman, and H. Li. 2012. Associating microbiome composition with environmental covariates using generalized UniFrac distances. Bioinformatics. 28(16):2106-2113. doi:10.1093/bioinformatics/bts342

Clarke, K.R., P.J. Somerfield, and M.G. Chapman. 2006. On resemblance measures for ecological studies, including taxonomic dissimilarities and a zero-adjusted Bray-Curtis coefficient for denuded assemblages. Journal of Experimental Marine Biology and Ecology 330:55-80.

De’ath, G. 1999. Extended dissimilarity: a method of robust estimation of ecological distances from high beta diversity data. Plant Ecology 144:191-199.

Faith, D.P., P.R. Minchin, and L. Belbin. 1987. Compositional dissimilarity as a robust measure of ecological distance. Vegetatio 69:57-68.

Goslee, S.C., and D.L. Urban. 2007. The ecodist package for dissimilarity-based analysis of ecological data. Journal of Statistical Software 22(7):1-19.

Gower, J.C. 1971. A general coefficient of similarity and some of its properties. Biometrics 27:857–874.

Greenacre, M., and R. Primicerio. 2013. Measures of distance between samples: Non-Euclidean. Ch. 5 in Multivariate analysis of ecological data. Fundación BBVA. http://www.multivariatestatistics.org/publications.html

Jaccard, P. 1912. The distribution of the flora in the alpine zone. The New Phytologist 11:37-50.

Legendre, P., and M. De Cáceres. 2013. Beta diversity as the variance of community data: dissimilarity coefficients and partitioning. Ecology Letters 16:951-963.

Legendre, P., and L. Legendre. 2012. Numerical ecology. 3rd English Edition. Elsevier, Amsterdam, The Netherlands.

Lozupone, C., and R. Knight. 2005. UniFrac: a new phylogenetic method for comparing microbial communities. Applied and Environmental Microbiology 71(12):8228-8235. doi:10.1128/AEM.71.12.8228-8235.2005.

Lozupone, C., M. Hamady, S.T. Kelley, and R. Knight. 2007. Quantitative and qualitative β diversity measures lead to different insights into factors that structure microbial communities. Applied and Environmental Microbiology 73(5):1576-1585. doi:10.1128/AEM.01996-06

Mahalanobis, P.C. 1936. On the generalised distance in statistics. Proceedings of the National Institute of Sciences of India 2(1):49-55.

McCune, B., and J.B. Grace. 2002. Analysis of ecological communities. MjM Software Design, Gleneden Beach, OR.

Plantinga, A.M., J. Chen, R.R. Jenq, and M.C. Wu. 2019. pldist: ecological dissimilarities for paired and longitudinal microbiome association analysis. Bioinformatics 35(19):3567-3575. doi: 10.1093/bioinformatics/btz120

Sørensen, T.J. 1948. A method of establishing groups of equal amplitude in plant sociology based on similarity of species content and its application to analyses of the vegetation on Danish commons. Kongelige Danske videnskabernes selskab 5(4):1-34. [need to verify citation]

License

Icon for the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

Applied Multivariate Statistics in R Copyright © 2024 by Jonathan D. Bakker is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, except where otherwise noted.

Introduction

Euclidean Distance

Manhattan Distance

Jaccard Similarity and Dissimilarity

Sorensen Similarity and Dissimilarity

Bray-Curtis Distance

Chi-Square Distance

UniFrac Distance

Gower’s Distance

Mahalanobis Distance

Should Shared Absences Matter?

Dealing With Empty Sample Units

R Functions to Calculate Distance Measures

stats::dist()

vegan::vegdist()

labdsv::dsvdis()

ecodist::distance()

betapart

Examples

Today’s Example Dataset

Oak Plant Community Dataset

Conclusions

References

License

Share This Book

`stats::dist()`

`vegan::vegdist()`

`labdsv::dsvdis()`

`ecodist::distance()`

`betapart`