Sample Datasets

Jonathan D. Bakker

Group Comparisons

15 Sample Datasets

Learning Objectives

To introduce two datasets for use throughout this section of the course:

A simple dataset for making calculations by hand
A larger dataset to illustrate how statistical tests can be applied in R

The larger dataset includes a script to automate its loading and initial data adjustments.

Key Packages

require(vegan, labdsv, tidyverse)

Throughout this section, we will work with the following two datasets. Each chapter assumes that you’ve loaded the data as described below.

Simple Example

This dataset is small enough that we can do calculations by hand. Seeing the calculations by hand helps clarify what happens when we apply the same techniques to larger datasets. We can also verify the calculations by repeating the analysis in R.

The data include two response variables (Resp1 and Resp2) on 6 plots, and a column identifying the group to which each plot belongs.

Sample Unit	Resp1	Resp2	Group
Plot1	1	4	A
Plot2	3	2	A
Plot3	5	3	A
Plot4	9	12	B
Plot5	10	8	B
Plot6	11	11	B

We can use a dataset like this to ask the question ‘do groups A and B differ in overall response’? Note that this is a multivariate question; we are not asking about Resp1 or Resp2 individually. Follow-up analyses could consider the individual response variables if the multivariate response is significant.

This dataset is available as the file ‘Permutation.example.csv’ from the book's GitHub repository. Download it to the 'data' folder of your course folder.

Open your R project and then load these data:
perm.eg <- read.csv("data/Permutation.example.csv", header = TRUE, row.names = 1)

It can also be manually entered:
perm.eg <- data.frame( row.names = c("Plot1", "Plot2", "Plot3", "Plot4", "Plot5", "Plot6"), Resp1 = c(1, 3, 5, 9, 10, 11), Resp2 = c(4, 2, 3, 12, 8, 11),
Group = c("A", "A", "A", "B", "B", "B") )

Here’s the distance matrix for our simple example:
Resp.dist <- perm.eg |> dplyr::select(Resp1, Resp2) |> dist() round(Resp.dist, 3)

       Plot1  Plot2  Plot3  Plot4  Plot5
Plot2  2.828                            
Plot3  4.123  2.236                     
Plot4 11.314 11.662  9.849              
Plot5  9.849  9.220  7.071  4.123       
Plot6 12.207 12.042 10.000  2.236  3.162

I rounded the distance matrix to 3 decimal places for display purposes; in practice I would keep all decimals as calculated by R.

Grazing Example (with a script!)

We’ll also look at the oak plant community dataset. Specifically, we’ll ask whether community composition is correlated with differences in current grazing status.

We begin by importing the data. Recall from the metadata that this dataset contains data from 47 stands. We’ll create separate objects for the composition and explanatory data. We’ll then make two adjustments to the composition data:

Remove rare species (those present on <5% of sample units)
Relativize by species maxima

Finally, we’ll use the Bray-Curtis distance measure to calculate the distance between every pair of stands.

We’ve done all of these steps before. We could continue to write these steps out each time, but I’ve prepared a script to conduct them. The script (load.oak.data.R) is available in the book’s GitHub repository.

Once you’ve saved the script to the ‘scripts’ sub-folder within your analysis folder and opened your R project file, call the script using source():
source("scripts/load.oak.data.R")

Open the script and review it to ensure that you understand what happens in it:

Three packages (vegan, labdsv, tidyverse) are loaded
Data files are imported
Response and explanatory variables are saved to separate objects
Rare species are removed from compositional data
Compositional data are relativized by species maxima
Bray-Curtis distance matrix is calculated from the relativized compositional data
Resulting matrix is saved as the object Oak1.dist

If our research questions warranted other changes (e.g., relativizing by site totals), we could adjust the script to include them. The original objects are also available in RStudio if you want to just use the script to load them and then use them in other ways.

Our grouping factor for this example is current grazing status (Yes, No). We could index this factor each time we need it (Oak$GrazCurr) but for clarity we’ll create an object consisting of just the grazing status of each plot:
grazing <- Oak$GrazCurr

This is not part of the script because the focal explanatory variables will often vary from study to study.

License

Icon for the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

Applied Multivariate Statistics in R Copyright © 2024 by Jonathan D. Bakker is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, except where otherwise noted.

Simple Example

Grazing Example (with a script!)

License

Share This Book