Group Comparisons

25 Controlling Permutations

Learning Objectives

To understand how to replicate analyses by controlling the set of permutations used.

Key Packages

require(vegan, permute)

 

Introduction

The purpose of this chapter is to illustrate how to control permutations.  This is important primarily so that all aspects of an analysis – including the P-value – are replicable.  We’ve already seen multiple examples of analyses in which the test statistic is identical but the statistical significance varies.  Here, we want to consider ways to control which set of permutations is used.  Doing so means that the statistical significance will also be unchanged from one run to another, even if those runs are at different times or even on different machines.

These notes are illustrated with analyses using vegan::adonis2().  However, the permutation strategies that are developed here can be applied to any of the functions from the vegan package that use the same permutations argument.

RRPP is coded differently, and has a built-in ability to control permutations.  In RRPP::lm.rrpp(), the set of permutations is identified by the number of permutations requested.  As long as you request the same number of permutations, the statistical significance will be the same from run to run.  If you change the number of permutations, a different set of permutations is used and the statistical significance may differ.

Grazing Example

We will illustrate the techniques in this section using our oak dataset.

source("scripts/load.oak.data.R")

 

The oak plant community dataset contains factors indicating whether stands were grazed in the past and/or grazed at the time the data were collected (GrazPast and GrazCurr, respectively).  As we’ve done before, we’ll combine these factors together into a new Grazing factor (capitalized):

Grazing <- factor(with(Oak, paste(GrazPast, GrazCurr, sep = "_")))

summary(Grazing)

  No_No  Yes_No Yes_Yes 
     24       6      17

How to Control Permutations

Permutations are conducted by drawing on a random number generator within R (we won’t worry about how it uses this random number generator).

It is sometimes helpful to know exactly which permutations(s) you used so that, for example, you can exactly replicate an analysis.  Here are three ways to do so.

set.seed()

The sample() function was introduced in the chapter about permutations tests.  This function simply rearranges the data – it doesn’t change the data itself.  For example, here is a rearrangement of the numbers from 1 to 10:

sample(10)

6  4 10  1  9  7  5  2  8  3

If we re-run it, we get a different rearrangement:

sample(10)

2 10  3  8  6  4  1  7  9  5

 

To exactly replicate a permutation test, we need to use the same set of permutations and therefore have to start at the same value in the random number generator.  The set.seed() function allows you to do so.  If you set this, run a permutation test, re-set it, and re-run the permutation test, you will get the same permutation(s) each time and thus exactly the same results.

set.seed(42); sample(10)

10  9  3  6  4  8  5  1  2  7

set.seed(42); sample(10) #identical to previous

10  9  3  6  4  8  5  1  2  7

set.seed(40); sample(10) #different

7  8  6  1  2  3 10  9  4  5

set.seed(NULL) #reset random number generator
sample(10)

8  1  6  7  4 10  3  9  5  2

Key Takeaways

The set.seed() function allows you to specify where to start in the random number generator and thus which set of permutations to use.

 

Each time you call the random number generator you ‘use’ a different set of numbers.  This means that the set.seed() function should be reset before each use.  For example, setting it once and then generating two sets results in two different sets:

set.seed(42); sample(10); sample(10)

 [1]  1  5 10  8  2  4  6  9  7  3
 [1]  8  7  4  1  5 10  2  6  9  3

Resetting it before each set ensures that they’re the same:

set.seed(42); sample(10); set.seed(42); sample(10)

 [1]  1  5 10  8  2  4  6  9  7  3
 [1]  1  5 10  8  2  4  6  9  7  3

 

In older versions of R (pre-3.6.0), this function gave identical results regardless of platform and in different sessions.  Newer versions of R use a different random number generator by default and therefore do not necessarily give identical results unless you change the type of random number generator that is being used.  An explanation of this issue is hereYou can change the random number generator for an individual action or for an entire session.

Changing one action:

set.seed(40, sample.kind = "Rounding"); sample(10)

7  8  6  1  2  3 10  9  4  5

This value should match among all of our machines and runs.

 

Changing the random number generator for the rest of an R session:

RNGkind(sample.kind = "Rounding")

This function would need to be run once per session, before using set.seed().

 

Resetting to default random number generator:

RNGkind(sample.kind = "default")

sample() and permute::shuffle()

You can create a permutation using the sample() function, which we’ve seen earlier, or the shuffle() function.  The shuffle() function has the following usage:

shuffle(
n,
control = how()
)

The main arguments are:

  • n – how many permuted values to return.  Usually set to the number of samples in the object.
  • control – how the sample units are to be permuted.  See chapter about restricting permutations for more information about the how() function.

The result is a permutation of the integers from 1:n, just like we saw with sample() above.  For example, here is a permutation of the sample units in our oak dataset:

perm1 <- shuffle(n = length(Grazing))

 [1] 44 36 15 41 35 21 28 26 32  4 47 23 25 17  6  1 46 24  5  8 40  3 42
[24] 37 29  7 31 39 13 20 12 34  9 14  2 45 38 10 43 22 11 18 30 27 16 19
[47] 33

Note how the integers from 1 to 47 have been permuted.  Each integer refers to a position in the original object.  For example, the first value (shown in bold) is now in the 16th position.

 

This permutation can be used to index other objects as occurs in a permutation test.  For example, let’s use this permutation to index the species richness of the stands.  Here’s the original species richness data:

Oak_explan[ , "SppRich"]

 [1] 32 41 35 33 32 51 37 30 29 38 35 33 38 45 41 34 23 22 22 32 29 18 51
[24] 20 32 36 40 26 34 35 32 18 46 16 38 49 40 33 41 28 38 33 61 44 27 38
[47] 33

It can be confusing to distinguish sets of numbers.  These aren’t the integers that we saw above – these values don’t get below 16 and get as high as 61.  The first stand has 32 species (shown in bold).

Here’s the permuted species richness data:

Oak_explan[perm1, "SppRich"]

 [1] 44 49 41 38 38 29 26 36 18 33 33 51 32 23 51 32 38 20 32 30 28 35 33
[24] 40 34 37 32 41 38 32 33 16 29 45 41 27 33 38 61 18 35 22 35 40 34 22
[47] 46

By comparing these sets of data, you can verify that the order of the species richness values has been adjusted based on the ordering in perm1.  For example, the integer 1 is in the 16th position in perm1 (shown in bold).   In the permuted species richness data, the richness from stand 1 is now in the 16th position.

 

Can you calculate the average richness in each grazing treatment, using both the real data and the ordering in perm1?

Start with the real data:

Oak_explan |>
mutate(Grazing = Grazing) |>
  group_by(Grazing) |>
reframe(SppRich = mean(SppRich))

# A tibble: 3 × 2
  Grazing SppRich
  <fct>     <dbl>
1 No_No      30.4
2 Yes_No     36.2
3 Yes_Yes    39.6

These are the average number of species in each grazing treatment – on average, stands in the Yes_Yes treatment contain the most species and those in the No_No treatment contain the least species.

We could use a set of permutations to assess statistical significance, as we’ve seen throughout this section of the course.  Here are the average values based on one permutation:

Oak_explan |>
mutate(Grazing = Grazing[perm1]) |>
group_by(Grazing) |>
reframe(SppRich = mean(SppRich))

# A tibble: 3 × 2
  Grazing SppRich
  <fct>     <dbl>
1 No_No      35.1
2 Yes_No     35.8
3 Yes_Yes    33

As expected, the average richness values are all very similar to one another once the data are permuted.

We could have permuted either Grazing or SppRich but we would not have wanted to do both … do you see why?

permute::shuffleSet()

Since shuffle() creates one permutation, you have to call it each time you want to generate a new permutation.  The shuffleSet() function works just like shuffle(), except that it includes an additional argument (nset) that allows you to specify how many permutations to create at once.

perms <- shuffleSet(n = length(Grazing), nset = 5); perms

Note the dimensionality of this object: one row per permutation and one column per sample unit.

This set of permutations could be used in a for() loop or a matrix algebra calculation to calculate the results of all permutations.  See the ‘Restricting Permutations‘ chapter for an example.

For even more control over this set of permutations, it can be proceeded by set.seed().  This would ensure that the entire set of permutations is replicable.

License

Icon for the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

Applied Multivariate Statistics in R Copyright © 2024 by Jonathan D. Bakker is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, except where otherwise noted.

Share This Book