Group Comparisons
15 Sample Datasets
Learning Objectives
To introduce two datasets for use throughout this section of the course:
- A simple dataset for making calculations by hand
- A larger dataset to illustrate how statistical tests can be applied in R
The larger dataset includes a script to automate its loading and initial data adjustments.
Key Packages
require(vegan, labdsv, tidyverse)
Throughout this section, we will work with the following two datasets. Each chapter assumes that you’ve loaded the data as described below.
Simple Example
This dataset is small enough that we can do calculations by hand. Seeing the calculations by hand helps clarify what happens when we apply the same techniques to larger datasets. We can also verify the calculations by repeating the analysis in R.
The data include two response variables (Resp1 and Resp2) on 6 plots, and a column identifying the group to which each plot belongs.
Sample Unit | Resp1 | Resp2 | Group |
Plot1 | 1 | 4 | A |
Plot2 | 3 | 2 | A |
Plot3 | 5 | 3 | A |
Plot4 | 9 | 12 | B |
Plot5 | 10 | 8 | B |
Plot6 | 11 | 11 | B |
We can use a dataset like this to ask the question ‘do groups A and B differ in overall response’? Note that this is a multivariate question; we are not asking about Resp1 or Resp2 individually. Follow-up analyses could consider the individual response variables if the multivariate response is significant.
This dataset is available as the file ‘Permutation.example.csv’ from the book's GitHub repository. Download it to the 'data' folder of your course folder.
Open your R project and then load these data:
perm.eg <- read.csv("data/Permutation.example.csv", header = TRUE, row.names = 1)
It can also be manually entered:
perm.eg <- data.frame(
row.names = c("Plot1", "Plot2", "Plot3", "Plot4", "Plot5", "Plot6"),
Resp1 = c(1, 3, 5, 9, 10, 11),
Resp2 = c(4, 2, 3, 12, 8, 11),
Group = c("A", "A", "A", "B", "B", "B")
)
Here’s the distance matrix for our simple example:
Resp.dist <- perm.eg |>
dplyr::select(Resp1, Resp2) |>
dist()
round(Resp.dist, 3)
Plot1 Plot2 Plot3 Plot4 Plot5
Plot2 2.828
Plot3 4.123 2.236
Plot4 11.314 11.662 9.849
Plot5 9.849 9.220 7.071 4.123
Plot6 12.207 12.042 10.000 2.236 3.162
I rounded the distance matrix to 3 decimal places for display purposes; in practice I would keep all decimals as calculated by R.
Grazing Example (with a script!)
We’ll also look at the oak plant community dataset. Specifically, we’ll ask whether community composition is correlated with differences in current grazing status.
We begin by importing the data. Recall from the metadata that this dataset contains data from 47 stands. We’ll create separate objects for the composition and explanatory data. We’ll then make two adjustments to the composition data:
- Remove rare species (those present on <5% of sample units)
- Relativize by species maxima
Finally, we’ll use the Bray-Curtis distance measure to calculate the distance between every pair of stands.
We’ve done all of these steps before. We could continue to write these steps out each time, but I’ve prepared a script to conduct them. The script (load.oak.data.R
) is available in the book’s GitHub repository.
Once you’ve saved the script to the ‘scripts’ sub-folder within your analysis folder and opened your R project file, call the script using source()
:
source("scripts/load.oak.data.R")
Open the script and review it to ensure that you understand what happens in it:
- Three packages (
vegan
,labdsv
,tidyverse
) are loaded - Data files are imported
- Response and explanatory variables are saved to separate objects
- Rare species are removed from compositional data
- Compositional data are relativized by species maxima
- Bray-Curtis distance matrix is calculated from the relativized compositional data
- Resulting matrix is saved as the object
Oak1.dist
If our research questions warranted other changes (e.g., relativizing by site totals), we could adjust the script to include them. The original objects are also available in RStudio if you want to just use the script to load them and then use them in other ways.
Our grouping factor for this example is current grazing status (Yes, No). We could index this factor each time we need it (Oak$GrazCurr
) but for clarity we’ll create an object consisting of just the grazing status of each plot:
grazing <- Oak$GrazCurr
This is not part of the script because the focal explanatory variables will often vary from study to study.