Foundational Concepts
6 Transformations
Learning Objectives
To consider how transformations relate to the research questions being addressed.
To illustrate how to transform data in R.
A transformation is applied identically to all data elements. This means that the result of an action is the same whether you consider an element alone or as part of a set. For example, calculating the square root of a value is unaffected by whether that value is part of a set. This is contrasted with relativizations, where the result of the action depends on other elements in the set.
Transformations
Transformations are applied identically to all elements within an object.
Transformations
There are many potential transformations that can be applied to data; we will review the most common ones here. McCune & Grace (2002, p. 67) note that transformations can be conducted for statistical or ecological reasons. However, many of the techniques we will consider do not require normality and other assumptions of parametric techniques. Thus, we can focus our transformations on the ecological questions that we seek to answer.
If you apply a transformation to univariate data such as an explanatory variable, the data should generally be back-transformed to the original units for presentation – as is true for all analyses.
Roots (square root, cube root, etc.)
Root transformations can be applied to count data, which generally follow a Poisson distribution.
Different orders of roots can be used, depending on the range of values in the data. For example, marine benthic studies can include organisms from phyla that span several orders of magnitude in abundance – there might be one starfish but tens of thousands of smaller invertebrates. Fourth-root transformations are often applied to this type of data so that the numerically dominant smaller taxa do not overwhelm comparisons among sample units.
Logarithms
Biomass or ratio data are often log-transformed. This commonly involves base-10 or natural logarithms (make sure to note which you use!).
A logarithm can’t be calculated on zero. If your data include zeroes, you may need to add a small value to allow this calculation to proceed. Note that you would add this value to all data points being transformed, not just the zeroes. This can be done manually, or you can use an existing function such as log1p().
Arcsin-square root
The arcsin-square root transformation can be used with proportional data. For example, vegetation abundance is often expressed as percent cover, which is easily converted to proportions. This transformation doesn’t work for negative values or values > 1.
Some authors strongly discourage using this transformation for univariate analyses (Warton & Hui 2011), but I have not seen this recommendation carried over to multivariate contexts.
Binary
A binary transformation converts continuous data to 0 or 1 based on whether a criterion is met. This is most often used to convert abundance data to presence/absence. Another example of this as a transformation would be to evaluate whether abundance data exceed a static value such as ‘5 individuals’ or ‘5% cover’.
Depending on the criterion, a binary adjustment can also be a type of relativization (see the ‘Relativizations‘ chapter).
Applications in R
In R, transformations are easily performed by applying a function to a matrix; the function is automatically applied to every element in the matrix.
The transformed data are generally assigned to a new object so that the original data remain intact.
Here are the above transformations applied to the object x:
| R Function | Note |
|---|---|
sqrt(x) or x^(1/2) |
Square root of x |
x^(1/4) |
Fourth root of x |
log10(x) |
Logarithm (base 10) of x |
log(x) |
Natural logarithm of x |
log1p(x) |
Natural logarithm of x + 1 (can be applied to zero values) |
asin(sqrt(x)) |
Arcsin square root of x |
ifelse(x > 0, 1, 0) |
Convert x to presence/absence |
Oak Plant Communities Example
Let’s illustrate these transformations using our oak plant communities dataset. Begin by opening the R project and the loading the data:
Oak <- read.csv("data/Oak_data_47x216.csv", header = TRUE, row.names = 1)
Oak_species <- read.csv("data/Oak_species_189x5.csv", header = TRUE)
Create separate objects for the response and explanatory data:
Oak_abund <- Oak[ , colnames(Oak) %in% Oak_species$SpeciesCode]
Oak_explan <- Oak[ , ! colnames(Oak) %in% Oak_species$SpeciesCode]
See the ‘Loading Data‘ chapter if you do not understand what these actions accomplished.
Transforming the Response Variables
The response variables are usually treated as a set. To illustrate, let’s apply a square root transformation to our response variables:
Sqrt_Oak_abund <- sqrt(Oak_abund)
Compare the two objects to verify that the data changed as intended. For example, consider the abundance of Abgr.s in stand 28:
sqrt(Oak_abund["Stand28", "Abgr.s"])
Sqrt_Oak_abund["Stand28", "Abgr.s"]
Confirm that the value is the same when the transformation is applied to just this value or to the entire object. This is one way to confirm that this was a transformation rather than a relativization.
Although we used the response variables to illustrate a transformation, in reality it would not make sense to apply this transformation to the entire matrix because the responses are a mix of cover values and of basal areas – see the metadata for details.
Transforming Explanatory Variables
A study often includes many types of explanatory variables – continuously distributed predictors, experimental factors, etc. If therefore would rarely not make sense to apply the same transformation to a matrix of explanatory variables.
However, it can make sense to transform individual variables. Each explanatory variable should be evaluated separately to determine which type of transformation, if any, is appropriate.
Let’s transform the number of large oak trees. These data are a count, so we’ll use a log transformation:
Oak_explan$log_Quga <- log10(Oak_explan$Quga.gt60cm + 1)
I added one to all values to account for the possibility that a stand may not have had any large oak trees.
In fact, this variable is already present in the data frame as the variable ‘LogQuga.gt60cm’. Compare these two variables to verify that our calculation was done correctly:
library(tidyverse)
Oak_explan |>
mutate(diff = round(log_Quga, 2) - LogQuga.gt60cm) |>
filter(diff != 0)
The existing variable is reported to two decimal places so we had to do the same to our new variable to avoid rounding differences. Verify the outcome of the above comparison by changing from a test for inequality (!=) to a test for equality (==).
Concluding Thoughts
Decisions about whether and how to transform the data can strongly affect the conclusions of subsequent analyses. Most of the techniques that we are using in this course make minimal statistical assumptions, which means that adjustments do not have to be made for statistical reasons but rather can focus on the ecological questions of interest.
Transformations can be applied to both response variables and explanatory variables. Response variables are often transformed en masse. Explanatory variables are usually transformed individually. Each explanatory variable can be evaluated separately to determine which type of transformation, if any, is appropriate.
Transformations should be scripted rather than permanently changing the raw data file. Scripting ensures flexibility to try other adjustments, skip them entirely, etc.
The transformations that have been discussed here are for continuously distributed variables. For categorical explanatory variables, other actions may be required such as combining similar categories together or restricting analyses to focus on a subset of the categories. These decisions should be based on the objectives of the analysis and the ecological questions that you seek to answer.
References
McCune, B., and J.B. Grace. 2002. Analysis of Ecological Communities. MjM Software Design, Gleneden Beach, OR.
Warton, D.I., and F.K. Hui. 2011. The arcsine is asinine: the analysis of proportions in ecology. Ecology 92:3-10.