Group Comparisons
25 Controlling Permutations
Learning Objectives
To understand how to replicate analyses by controlling the set of permutations used.
Key Packages
require(vegan, permute)
Introduction
The purpose of this chapter is to illustrate how to control permutations. This is important primarily so that all aspects of an analysis – including the P-value – are replicable. We’ve already seen multiple examples of analyses in which the test statistic is identical but the statistical significance varies. Here, we want to consider ways to control which set of permutations is used. Doing so means that the statistical significance will also be unchanged from one run to another, even if those runs are at different times or even on different machines.
These notes are illustrated with analyses using vegan::adonis2()
. However, the permutation strategies that are developed here can be applied to any of the functions from the vegan
package that use the same permutations
argument.
RRPP is coded differently, and has a built-in ability to control permutations. In RRPP::lm.rrpp()
, the set of permutations is identified by the number of permutations requested. As long as you request the same number of permutations, the statistical significance will be the same from run to run. If you change the number of permutations, a different set of permutations is used and the statistical significance may differ.
Grazing Example
We will illustrate the techniques in this section using our oak dataset.
source("scripts/load.oak.data.R")
The oak plant community dataset contains factors indicating whether stands were grazed in the past and/or grazed at the time the data were collected (GrazPast
and GrazCurr
, respectively). As we’ve done before, we’ll combine these factors together into a new Grazing
factor (capitalized):
Grazing <- factor(with(Oak, paste(GrazPast, GrazCurr, sep = "_")))
summary(Grazing)
No_No Yes_No Yes_Yes
24 6 17
How to Control Permutations
Permutations are conducted by drawing on a random number generator within R (we won’t worry about how it uses this random number generator).
It is sometimes helpful to know exactly which permutations(s) you used so that, for example, you can exactly replicate an analysis. Here are three ways to do so.
set.seed()
The sample()
function was introduced in the chapter about permutations tests. This function simply rearranges the data – it doesn’t change the data itself. For example, here is a rearrangement of the numbers from 1 to 10:
sample(10)
6 4 10 1 9 7 5 2 8 3
If we re-run it, we get a different rearrangement:
sample(10)
2 10 3 8 6 4 1 7 9 5
To exactly replicate a permutation test, we need to use the same set of permutations and therefore have to start at the same value in the random number generator. The set.seed()
function allows you to do so. If you set this, run a permutation test, re-set it, and re-run the permutation test, you will get the same permutation(s) each time and thus exactly the same results.
set.seed(42); sample(10)
10 9 3 6 4 8 5 1 2 7
set.seed(42); sample(10) #identical to previous
10 9 3 6 4 8 5 1 2 7
set.seed(40); sample(10) #different
7 8 6 1 2 3 10 9 4 5
set.seed(NULL) #reset random number generator
sample(10)
8 1 6 7 4 10 3 9 5 2
Key Takeaways
The set.seed()
function allows you to specify where to start in the random number generator and thus which set of permutations to use.
Each time you call the random number generator you ‘use’ a different set of numbers. This means that the set.seed()
function should be reset before each use. For example, setting it once and then generating two sets results in two different sets:
set.seed(42); sample(10); sample(10)
[1] 1 5 10 8 2 4 6 9 7 3
[1] 8 7 4 1 5 10 2 6 9 3
Resetting it before each set ensures that they’re the same:
set.seed(42); sample(10); set.seed(42); sample(10)
[1] 1 5 10 8 2 4 6 9 7 3
[1] 1 5 10 8 2 4 6 9 7 3
In older versions of R (pre-3.6.0), this function gave identical results regardless of platform and in different sessions. Newer versions of R use a different random number generator by default and therefore do not necessarily give identical results unless you change the type of random number generator that is being used. An explanation of this issue is here. You can change the random number generator for an individual action or for an entire session.
Changing one action:
set.seed(40, sample.kind = "Rounding"); sample(10)
7 8 6 1 2 3 10 9 4 5
This value should match among all of our machines and runs.
Changing the random number generator for the rest of an R session:
RNGkind(sample.kind = "Rounding")
This function would need to be run once per session, before using set.seed()
.
Resetting to default random number generator:
RNGkind(sample.kind = "default")
sample()
and permute::shuffle()
You can create a permutation using the sample()
function, which we’ve seen earlier, or the shuffle()
function. The shuffle()
function has the following usage:
shuffle(
n,
control = how()
)
The main arguments are:
n
– how many permuted values to return. Usually set to the number of samples in the object.control
– how the sample units are to be permuted. See chapter about restricting permutations for more information about thehow()
function.
The result is a permutation of the integers from 1:n
, just like we saw with sample()
above. For example, here is a permutation of the sample units in our oak dataset:
perm1 <- shuffle(n = length(Grazing))
[1] 44 36 15 41 35 21 28 26 32 4 47 23 25 17 6 1 46 24 5 8 40 3 42
[24] 37 29 7 31 39 13 20 12 34 9 14 2 45 38 10 43 22 11 18 30 27 16 19
[47] 33
Note how the integers from 1 to 47 have been permuted. Each integer refers to a position in the original object. For example, the first value (shown in bold) is now in the 16th position.
This permutation can be used to index other objects as occurs in a permutation test. For example, let’s use this permutation to index the species richness of the stands. Here’s the original species richness data:
Oak_explan[ , "SppRich"]
[1] 32 41 35 33 32 51 37 30 29 38 35 33 38 45 41 34 23 22 22 32 29 18 51
[24] 20 32 36 40 26 34 35 32 18 46 16 38 49 40 33 41 28 38 33 61 44 27 38
[47] 33
It can be confusing to distinguish sets of numbers. These aren’t the integers that we saw above – these values don’t get below 16 and get as high as 61. The first stand has 32 species (shown in bold).
Here’s the permuted species richness data:
Oak_explan[perm1, "SppRich"]
[1] 44 49 41 38 38 29 26 36 18 33 33 51 32 23 51 32 38 20 32 30 28 35 33
[24] 40 34 37 32 41 38 32 33 16 29 45 41 27 33 38 61 18 35 22 35 40 34 22
[47] 46
By comparing these sets of data, you can verify that the order of the species richness values has been adjusted based on the ordering in
perm1
. For example, the integer 1 is in the 16th position in perm1
(shown in bold). In the permuted species richness data, the richness from stand 1 is now in the 16th position.
Can you calculate the average richness in each grazing treatment, using both the real data and the ordering in perm1
?
Start with the real data:
Oak_explan |>
mutate(Grazing = Grazing) |>
group_by(Grazing) |>
reframe(SppRich = mean(SppRich))
# A tibble: 3 × 2
Grazing SppRich
<fct> <dbl>
1 No_No 30.4
2 Yes_No 36.2
3 Yes_Yes 39.6
These are the average number of species in each grazing treatment – on average, stands in the Yes_Yes treatment contain the most species and those in the No_No treatment contain the least species.
We could use a set of permutations to assess statistical significance, as we’ve seen throughout this section of the course. Here are the average values based on one permutation:
Oak_explan |>
mutate(Grazing = Grazing[perm1]) |>
group_by(Grazing) |>
reframe(SppRich = mean(SppRich))
# A tibble: 3 × 2
Grazing SppRich
<fct> <dbl>
1 No_No 35.1
2 Yes_No 35.8
3 Yes_Yes 33
As expected, the average richness values are all very similar to one another once the data are permuted.
We could have permuted either Grazing
or SppRich
but we would not have wanted to do both … do you see why?
permute::shuffleSet()
Since shuffle()
creates one permutation, you have to call it each time you want to generate a new permutation. The shuffleSet()
function works just like shuffle()
, except that it includes an additional argument (nset
) that allows you to specify how many permutations to create at once.
perms <- shuffleSet(n = length(Grazing), nset = 5); perms
Note the dimensionality of this object: one row per permutation and one column per sample unit.
This set of permutations could be used in a for()
loop or a matrix algebra calculation to calculate the results of all permutations. See the ‘Restricting Permutations‘ chapter for an example.
For even more control over this set of permutations, it can be proceeded by set.seed()
. This would ensure that the entire set of permutations is replicable.