Group Comparisons
20 GDM
Learning Objectives
To introduce the theory behind generalized dissimilarity modelling (GDM).
To illustrate how GDM can describe and predict compositional turnover.
Resources (recommended)
Mokany et al. (2022)
Key Packages
require(gdm)
Introduction
A distance or dissimilarity matrix contains information about all possible pairs of sample units. The amount of information increases rapidly with the number of sample units. When based upon compositional data, the dissimilarity between sample units is a measure of beta diversity (Anderson et al. 2011). Generalized dissimilarity modelling (GDM) was developed to understand turnover in community composition (i.e., beta diversity) across large regions (Ferrier et al. 2007; Mokany et al. 2022).
GDM, like the Mantel test, examines the full set of distances among all pairs of sample units. For n sample units, there are n(n - 1) / 2 distances. The set of distances is analyzed as a univariate response. As with Mantel tests, statistical significance is assessed via permutations of the sample units.
GDM relates this set of distances to the distances among those same pairs of sample units with regard to one or more predictor variables. It is intended for use when:
- Sample units are spatially organized, and those coordinates are available
- Sample units span large ecological gradients
- Studies are observational (vs. experimental)
- Predictors are continuously distributed. Ordinal categorical predictors can also be used, but nominal categorical variables cannot.
- Response variables are appropriately summarized in a dissimilarity matrix (bounded {0, 1})
Theory
Link Function
Dissimilarity measures such as Bray-Curtis are bounded {0, 1}. This means that it would be inappropriate to predict values below 0 or above 1. GDM is an extension of the Mantel test, which examines the correlation between two distance matrices, while recognizing these bounds. The predicted values from GDM cannot exceed the bounds {0, 1}.
Recall that a Mantel test examines the linear relationship between two distance matrices based on the same sample units. A fit that recognizes the bounds of a dissimilarity measure has to curve, or be non-linear. However, non-linear patterns can be treated as linear patterns if transformed appropriately. This is what occurs with generalized linear models - we specify the appropriate 'link function' based on the distributional form of the response that we are analyzing. GDM uses a negative exponential link function. This link function is chosen because it assumes that dissimilarities are bounded {0, 1} but monotonically increasing - as the distance between two sample units increases, the dissimilarity between them can only stay the same or increase.
[latex]d_{ij} = 1 - e^{-\eta}[/latex]
The link parameter, eta ([latex]\eta[/latex]) can have any positive value.

Predictors
The predictors (potential explanatory variables) can be of two broad types.
First, predictors can be data that were collected on the sample units - measuring Heat Load Index or soil texture, for example. This is what we will consider.
Second, predictors can be data that are available across large landscapes and approximated for each sample unit. For example, we could use the spatial coordinates of each sample unit to assign them to a pixel from a raster of variables. The advantage of this approach is that the predictors are available for a large area and the model results can therefore be used to make predictions across that entire area. I will not be illustrating this approach here; see Mokany et al. (2022) for examples.
GDM analyzes each predictor separately. A spline is fit to the predictor that expresses how different two sample units are expected to be based on how different they are in that predictor. This is repeated separately for each predictor, and the set of splines related to all of the predictors is then used to predict the ecological distance between sample units. This predicted ecological distance is then compared to the observed dissimilarity between sample units.
One issue with GDM is that the shape of this spline for a given predictor can depend on which predictors are being analyzed together.
Model Fitting
The figure below illustrates the model fitting procedure in GDM.
Each predictor is modeled separately by fitting a spline function to its values. This spline quantifies how different two sample units are in composition based on how different they are in that predictor. The slope of this spline can change across the range of values of the predictor. Where the slope is flat, there is little change in composition; two sample units that have values of the predictor in this range are predicted not to differ very much in composition. Where the slope is steeply positive, there is a large change in composition; two sample units that differ in the predictor in this range are predicted to differ much more strongly in composition.
For example, consider the sample units i and j. When these sample units are located on the spline for a predictor, the horizontal distance between them is the actual distance, or difference, between those two sample units with regard to that predictor. The vertical distance between values on the spline for these actual values of the predictor is the contribution that that predictor makes to the predicted ecological distance. In the below figure, Predictor B has a large contribution whereas Predictor C has a small contribution because of their different values of the predictors and shapes of their functions.
The contributions from all predictors are summed together. An intercept is added to account for the fact that two sample units may differ in composition (i.e., dissimilarity > 0) even when they have the same values for the predictors. The resulting value is then back-transformed via the link function into the predicted compositional dissimilarity between sample units i and j. This predicted dissimilarity is bounded {0-1} and shown as the horizontal axis of the graph on the left side of the figure; the vertical axis is the actual dissimilarity between each pair of sample units.

Oak Dataset
Let's use our oak data to explore GDM.
source("scripts/load.oak.data.R")
In our oak dataset, there is no relationship between the Bray-Curtis dissimilarity of the sample units (Oak1.dist) and the geographic distance between the sample units (geog.dis). A Mantel test (now shown) indicates that this pattern is not significant (r = 0.046, P = 0.238).

Nonetheless, let's relate the Bray-Curtis dissimilarity between sample units to a set of environmental variables:
- elevation (
Elev.m) PDIR- Heat Load Index (
HeatLoad) - depth of the A horizon (
AHoriz)
The gdm Package
The gdm package provides functions for all aspects of GDM:
- data preparation
- model fitting
- model evaluation
- model predictions
- visualization
Once you've installed this package from CRAN, load it as normal:
library(gdm)
The gdm package requires that data be formatted in a particular format. This is done with the formatsitepair() function.
Once the data are correctly formatted, we can fit the GDM model using the gdm() function. There are also functions to summarize the GDM (summary()), evaluate it (gdm.crossvalidation()), and assess the importance of each predictor (gdm.partition.deviance(), gdm.varImp()).
Model results can be used to predict the biological dissimilarities between sites (predict()) and to transform environmental predictors to biological importance (gdm.transform()).
Finally, model outputs can be plotted (plot()) and their sensitivity assessed via bootstrapping (plotUncertainty()).
We will illustrate some of these steps using our oak data.
Data Preparation (formatsitepair())
The gdm package requires that data be formatted in a particular format. This is done with the formatsitepair() function. The usage of this function is:
formatsitepair(
bioData,
bioFormat,
dist = "bray",
abundance = FALSE,
siteColumn = NULL,
XColumn,
YColumn,
sppColumn = NULL,
abundColumn = NULL,
sppFilter = 0,
predData,
distPreds = NULL,
weightType = "equal",
custWeights = NULL,
sampleSites = 1,
verbose=FALSE
)
bioData- the compositional response data, in one of four formatsbioFormat- integer identifying the format ofbioData:1- site-by-species matrix2- x, y, species list3- site-by-site biological distance (dissimilarity) matrix4- an existing site-pair table
dist- type of distance measure to calculate. Default is Bray-Curtis, though any distance measure available invegdist()can be specified. This is only used if bioData is in a format that still requires conversion to a distance matrix.abundance- whether compositional data are abundances (TRUE) or presence/absence data (FALSE). Default is the latter.siteColumn- name of column containing a unique identified for each sample unitXColumn- name of column containing the horizontal (X) coordinates of each sample unit. Must be present inbioDataand/orpredDataYColumn- name of column containing the vertical (Y) coordinates of each sample unit. Must be present inbioDataand/orpredDatapredData- environmental predictor data
Worked Example
We start by creating an object containing the predictors (predData argument). The gdm package requires a column that uniquely identifies each sample unit and the spatial coordinates of each sample unit, so we'll combine those with the four environmental variables that we identified above.
env <- Oak_explan |>
rownames_to_column(var = "Stand") |>
select(Stand, LatAppx, LongAppx, Elev.m, PDIR, HeatLoad, AHoriz)
We'll use our relativized oak data (Oak1) for the set of responses (bioData argument). We need to include the column identifying each sample unit in this object, so we'll do so and save it to a new object:
Oak1.stand <- Oak1 |>
rownames_to_column(var = "Stand")
Now we're ready to use formatsitepair() to organize the required data:
gdm.data <- formatsitepair(
bioData = Oak1.stand,
bioFormat = 1,
dist = "bray",
abundance = TRUE,
siteColumn = "Stand",
XColumn = "LatAppx",
YColumn = "LongAppx",
predData = env)
Fit GDM (gdm())
The usage of this function is:
gdm(
data,
geo = FALSE,
splines = NULL,
knots = NULL
)
The arguments are:
data- data frame created usingformatsitepair()geo- whether to include geographic distances between sample units as a model term. Default is to not do so (FALSE), even though they're required to create thedataobject.splines- optional vector of splines for fitting processknots- optional vector of knots for fitting process
Worked Example
It's easy to fit the GDM. Since we saw no evidence of a relationship between geographic distance and composition (see figure above), we'll accept the default and not require geographic distances to be a model term.
gdm.mod <- gdm(gdm.data)
We can use summary() to see a modelling summary:
summary(gdm.mod)
[1]
[1]
[1] GDM Modelling Summary
[1] Creation Date: Fri Jan 23 16:00:52 2026
[1]
[1] Name: gdm.mod
[1]
[1] Data: gdm.data
[1]
[1] Samples: 1081
[1]
[1] Geographical distance used in model fitting? FALSE
[1]
[1] NULL Deviance: 64.188
[1] GDM Deviance: 57.783
[1] Percent Deviance Explained: 9.979
[1]
[1] Intercept: 1.027
[1]
[1] PREDICTOR ORDER BY SUM OF I-SPLINE COEFFICIENTS:
[1]
[1] Predictor 1: PDIR
[1] Splines: 3
[1] Min Knot: 0.544
[1] 50% Knot: 0.853
[1] Max Knot: 0.996
[1] Coefficient[1]: 0.087
[1] Coefficient[2]: 0.15
[1] Coefficient[3]: 0.457
[1] Sum of coefficients for PDIR: 0.694
[1]
[1] Predictor 2: HeatLoad
[1] Splines: 3
[1] Min Knot: 0.636
[1] 50% Knot: 0.875
[1] Max Knot: 0.99
[1] Coefficient[1]: 0.116
[1] Coefficient[2]: 0.016
[1] Coefficient[3]: 0
[1] Sum of coefficients for HeatLoad: 0.131
[1]
[1] Predictor 3: Elev.m
[1] Splines: 3
[1] Min Knot: 76
[1] 50% Knot: 152
[1] Max Knot: 304
[1] Coefficient[1]: 0.069
[1] Coefficient[2]: 0
[1] Coefficient[3]: 0
[1] Sum of coefficients for Elev.m: 0.069
[1]
[1] Predictor 4: AHoriz
[1] Splines: 3
[1] Min Knot: 3
[1] 50% Knot: 13
[1] Max Knot: 52
[1] Coefficient[1]: 0.002
[1] Coefficient[2]: 0.041
[1] Coefficient[3]: 0
[1] Sum of coefficients for AHoriz: 0.044
The first part of the output summarizes the entire model. In this case, the total variance in the compositional dataset is 64.188 and the GDM accounts for 10% of this variation.
The predictors are reported in descending order of importance. Importance is determined by the coefficients associated with the splines for a predictor, which are reported separately and as a sum across the splines. The knots are expressed in the same units as the predictor.
In this case, PDIR, is the most important predictor. However, see below for the model evaluation.
Evaluate GDM (gdm.varImp())
The terms in a GDM can be evaluated using gdm.varImp(). The usage of this function is:
gdm.varImp(
spTable,
geo,
splines = NULL,
knots = NULL,
predSelect = FALSE,
nPerm = 50,
pValue=0.05,
parallel = FALSE,
cores = 2,
sampleSites = 1,
sampleSitePairs = 1,
outFile = NULL
)
The key arguments are:
spTable- the object fit togdm()geo- whether (TRUE) or not (FALSE) to include geographic distances between sample units as a model term.predSelect- whether to use backward elimination to drop predictors that are not significant. Default is to not do so (FALSE) and fit all predictors.nPerm- number of permutations to conduct. Default is 50.pValue- P-value to use if selecting and eliminating predictors. Default is 0.05.
Worked Example
Let's assess the importance of our predictor variables. Since we didn't test geographic distances initially, we'll continue to omit them here. Let's also increase the number of permutations to 100 and drop all predictors determined to be non-significant (P > 0.05).
gdm.varImp(
gdm.data,
geo = FALSE,
nPerm = 100
)
This function takes a bit of time to execute due to all of the permutations.
$`Model assessment`
All predictors
Model deviance 57.783
Percent deviance explained 9.979
Model p-value 0.000
Fitted permutations 100.000
$`Predictor Importance`
All predictors
Elev.m 3.064
PDIR 50.657
HeatLoad 7.751
AHoriz 3.239
$`Predictor p-values`
All predictors
Elev.m 0.62
PDIR 0.14
HeatLoad 0.50
AHoriz 0.75
$`Model Convergence`
All predictors
Elev.m 100
PDIR 100
HeatLoad 100
AHoriz 100
Predictor importance indicates how poorly the model fit if the indicated predictor was permuted. In this case, PDIR has the largest effect; models in which it is permuted explain 50.7% less deviance than those in which it is unpermuted.
The P-values indicate the proportion of permutations in which the re-ordered values of the predictor explain as much or more deviance than that explained when the predictor is unpermuted. In this case, PDIR has the lowest P-value (P = 0.14), though it is still not statistically significant.
Plot GDM Results
When applied to an object of class gdm, the plot() function will automatically create a series of graphs:
- Observed compositional dissimilarity vs. predicted ecological distance as predicted by the model
- Observed compositional dissimilarity vs. predicted compositional dissimilarity
- Splines for each predictor
The layout of these graphs can be adjusted using the plot.layout argument.
However, we can also extract the data to create and customize these graphics. We'll do so separately for the splines and the dissimilarities.
Splines
The splines related to each explanatory variable can be extracted using isplineExtract():
oak.splines <- isplineExtract(gdm.mod)
The resulting object contains a set (x) of the actual values for each explanatory variable and a set (y) of the partial ecological distance associated with each explanatory variable.
Let's organize oak.splines into a single data frame:
oak.splines.x <- oak.splines$x |>
data.frame() |>
pivot_longer(cols = everything(), names_to = "Variable_x", values_to = "x")
oak.splines.y <- oak.splines$y |>
data.frame() |>
pivot_longer(cols = everything(), names_to = "Variable_y", values_to = "y")
oak.splines.df <- cbind(oak.splines.x, oak.splines.y) |>
mutate(Variable_y = NULL) |>
rename(Variable = "Variable_x")
rm(oak.splines.x, oak.splines.y)
Though not significant, PDIR was the most important variable in our analysis. Let's look at its spline:
oak.splines.df |>
filter(Variable == "PDIR") |>
ggplot(aes(x = x, y = y)) +
geom_line() +
labs(x = "PDIR") +
theme_bw()
ggsave("graphics/GDM.spline.PDIR.png",
width = 3, height = 2.5, units = "in", dpi = 300)

This spline suggests that small differences in PDIR have a smaller effect on the distance between two stands at low than at high PDIR values. For example, if PDIR increases from 0.6 to 0.65, y changes by ~ 0.05 whereas as PDIR increases from 0.95 to 1.00, y changes by ~ 0.20.
Dissimilarities
The gdm.mod object contains numerous items, including three vectors of pairwise distances/dissimilarities. We can extract these and combine them into a data frame:
gdm.mod.df <- with(gdm.mod,
data.frame(observed = observed,
predicted = predicted,
ecological = ecological)
)
View a summary of the resulting object to explore the range of values in each column. They are:
observed- the actual compositional dissimilarity between each pair of sample unitsecological- the distance between each pair of sample units based on the environmental variables that were analyzed. Note that many values exceed 1, so these can't be dissimilarities. These are values of the link parameter, eta ([latex]\eta[/latex]).predicted- the estimated compositional dissimilarity between each pair of sample units based on the environmental variables that were analyzed. This is the value ofecologicalback-transformed into a dissimilarity, and can be calculated directly fromecological:
gdm.mod.df$predicted_manual <- 1 - exp( - gdm.mod.df$ecological)
Let's graph the predicted and observed compositional dissimilarity:
gdm.mod.df |>
ggplot(aes(x = predicted, y = observed)) +
geom_point(size = 0.5, alpha = 0.5) +
labs(x = "Predicted Dissimilarity",
y = "Actual Dissimilarity") +
theme_bw()
ggsave("graphics/GDM.dissimilarities.png",
width = 3, height = 2.5, units = "in", dpi = 300)

Oak1.dist. Each point represents the dissimilarity between a pair of stands.
There is not very good agreement between these two sets of dissimilarities. The four explanatory variables that we analyzed do not allow us to predict the compositional dissimilarities among stands. This could be a function of our choice of explanatory variables, but it also could reflect the limited ecological gradient spanned by our sample units.
Variations
Several variations on GDM have been proposed, including:
- Sparse Generalized Dissimilarity Modelling (SGDM) for multispectral data time series data (Leitão et al. 2015)
- Generative SPatial Generalized Dissimilarity Mixed Modelling (spGDMM) uses a Bayesian approach (White et al. 2024)
Conclusions
GDM is a computationally intensive extension of the Mantel test. It is appealing in that it recognizes the distributional form of the Bray-Curtis dissimilarity, but its utility with other types of dissimilarities is unclear.
The coding for GDM is more idiosyncratic than for the other analytical techniques covered here.
The intended uses for GDM are quite specific: observational studies of composition that span large environmental gradients and have associated spatial data and continuously-distributed predictor variables.
References
Ferrier, S., G. Manion, J. Elith, and K. Richardson. 2007. Using generalized dissimilarity modelling to analyze and predict patterns of beta diversity in regional biodiversity assessment. Diversity and Distributions 13:252-264.
Leitão, P.J., M. Schweider, S. Suess, I. Catry, E.J. Milton, F. Moreira, P.E. Osborne, M.J. Pinto, S. van der Linden, and P. Hostert. 2015. Mapping beta diversity from space: Sparse Generalised Dissimilarity Modelling (SGDM) for analysing high-dimensional data. Methods in Ecology and Evolution 6(7):764-771.
Mokany, K., C. Ware, S.N.C. Woolley, S. Ferrier, and M.C. Fitzpatrick. 2022. A working guide to harnessing generalized dissimilarity modelling for biodiversity analysis and conservation assessment. Global Ecology and Biogeography 31(4):802-821.
White, P.A., H.A. Frye, J.A. Slingsby, J.A. Silander Jr., and A.E. Gelfand. 2024. Generative spatial generalized dissimilarity mixed modelling (spGDMM): an enhanced approach to modelling beta diversity. Methods in Ecology and Evolution 15(1):214-226.
Media Attributions
- eta.v.d
- Mokany.et.al.2022.Figure2
- geog.v.BC
- GDM.spline.PDIR
- GDM.dissimilarities