Chapter 35 ANOVA Posthoc Comparisons

library(tidyverse)
library(DescTools)
library(ez)
library(lsr)

Where example data is shown below it is based upon the ANOVA analysis of the chickwts data set as shown Chapter30 .

35.1 Overview of options

The major issue to address in ANOVA posthoc testing is the multiple comparison problem. We have a family of comparisons. Every comparison is a hypothesis test. Every hypothesis test carries type1 and type2 error risk. Testing too many hypotheses in a family from one data set leads to certain error, if no adjustment is made for the multiple comparisons..

There are two fundamental ways to control the family-wise error rate (FWER) after ANOVA: 1) p-value adjustments, or 2) range tests.

The menagerie of options within each of these is large and confusing.

That largely stems from the fact that nobody really agrees on the best way to control FWER. When a fist fight breaks out at a international statistics conference you can almost be certain it is due to FWER. In fact, it would seem everyone who has ever had a strong opinion on the matter has contributed an eponymous test.

What is less obvious but most important is that you have a judgment to make on what the best approach for the problem you are trying to solve with your own experiments.

Here’s what is most important to keep in mind. Sometimes we do these multi-factorial experiments in an exploratory mindset, without precise planning for the groups we might want to compare. We may even compare every group to all other groups, and sometimes that is all right. Know that there are post hoc tests more suited for that “hit seeking” mindset compared to other tests.

Other times we’ve set up a severe test of a crucial scientific prediction. We have a clear idea of the comparisons we want to make. We hope the result will be the foundation of the next few years of work. We (should) have very low tolerance for type1 error in a situation like that. Know there are post hoc tests more suited than others for protecting against type1 error in cases like that.

That is it in a nutshell. The adjustment method you choose should be based upon your error tolerance, given the experiment and its goals.

35.1.1 Pairwise.t.test with p-value adjustments

I promote ANOVA posthoc testing using the pairwise.t.test. There are a few reasons for this.

  • We are already familiar with t-tests.
  • We only need one function for many kinds of uses.
  • It can be used for both completely randomized (unpaired) and for related measure (paired) designs.
  • Several p-value adjustment options are available.

The pairwise.t.test function is not the t.test function. The pairwise.t.test denominator uses standard error derived from the residual variance from the full experiment, rather than just that for each pair in a comparison. This is statistically justifiable because it takes into account the full context in which the selected two groups are compared.

A downside of using the pairwise.t.test is that the function only produces adjusted p-values. It does not produce adjusted confidence intervals.

Indeed, except for the straight Bonferroni, adjusted confidence intervals are not readily derived from these step procedures.

If confidence intervals are sought for inferential purposes, using a range test, such as the TukeyHSD, is a better post hoc option. For that, DescTools::PostHocTest(method = "hsd")

35.1.1.1 P-value adjustment methods

When using the pairwise.t.test we need to select a p-value adjustment method. Here are the options. The output of the test will be an array of adjusted p-values, corresponding to each of the comparisons made.

We reject the null for any comparisons where the adjusted p-values are lower than the pre-determined type1 FWER threshold (eg, 5%).

The table below compares the performance of these adjustment algorithms on a common vector of unadjusted p-values (see the none column). The pattern from left to right is clear. Both the Bonferroni and Holm offer the most severe type1 protection, the BY is balanced between type1 and type2, the Hommel and Hochberg are liberal and perform about the same, and the BH (aka FDR) is the most liberal.

(#tab:compare p-value adjustments)Comparison of p-values adjustment methods available in the pairwise.t.test function; most liberal on left to most conservative on right .
none BH_FDR hochberg hommel BY holm bonferroni
0.0028426 0.0142129 0.0142129 0.0142129 0.0324528 0.0142129 0.0142129
0.0155575 0.0194806 0.0215229 0.0215229 0.0444807 0.0609275 0.0777874
0.0152319 0.0194806 0.0215229 0.0215229 0.0444807 0.0609275 0.0761593
0.0155845 0.0194806 0.0215229 0.0215229 0.0444807 0.0609275 0.0779224
0.0215229 0.0215229 0.0215229 0.0215229 0.0491439 0.0609275 0.1076144
35.1.1.1.1 Bonferroni

The simplest to understand. Multiplies each unadjusted p-value, \(p\), by \(m\) total comparisons made.

\[\begin{equation} p_{adjust} = m \times p \end{equation}\]

Use the Bonferroni correction when seeking strong protection against type1 error. Of all of these p-adjustment methods, only the Bonferroni includes a pathway to calculate confidence intervals that are coherent with the adjusted p-values.

35.1.1.1.2 Sidak (unavailable)

Although not an option in R’s p.adjust.methods it is easy enough to code. Sidak adjusts each p by the total number of comparisons, m, using the following relationship.

\[\begin{equation} p_{adjust}=1-(1-p)^m \end{equation}\]

Use the Sidak correction when you wish strong protection against type1 error, but perhaps not as strong as Bonferroni.

35.1.1.1.3 Holm

Selecting the "holm" option in the p.adjust.methods function executes a step-based Bonferroni correction procedure otherwise known as the Holm-Bonferroni. The step procedure accounts for why the adjustments can share the same value. This operates after first ranking the unadjusted p-values from smallest to largest p-value, stepping up through them. At each step \(i\), the comparison value used in the Bonferroni is reduced to account for those already made. The cumulative maximum is reported.

\[\begin{equation} p_{adjust}=(m+1-i)\times p \end{equation}\]

Use the Holm correction when you wish strong protection against type1 error, but perhaps a bit less strong than the Bonferroni.

35.1.1.1.4 Hochberg

Selecting the "hochberg" option in the p.adjust.methods function also executes a step-based Bonferroni correction procedure. This differs markedly from the Holm procedure . The "hochberg" works by first ranking the unadjusted p-values from highest to lowest. And rather than return the cumulative maxima, it returns the minima. This will leave the highest original p-value unadjusted while the lowest p-value gets the maximal Bonferroni adjustment.

The overall outcome of the "hochberg" is a liberal p-value adjustment on the input array compared to the Holm and Bonferroni.

\[\begin{equation} p_{adjust}=(m+1-i)\times p \end{equation}\]

Use the Hochberg correction on experiments where you want strong protection against type2 errors. For example, if the experiment is designed more exploratory, closer to a hit screen than to a true hypothesis test.

35.1.1.1.5 Hommel

Everybody seems to agree the Hommel procedures is the most difficult to describe. Even if I showed the formula here I’m not sure I could describe it.12

As a general rule, I try to avoid using things I don’t understand.

As is evident from the result in the table of options, the Hommel adjustment performs along the lines of the Hochberg.

35.1.1.1.6 Benjamini-Hochberg (BM, FDR)

Selecting the "BH" option executes the same step-down procedure as for the "hochberg" but with a different correction. Furthermore, the "BH" and the "fdr" selections run the same calculation, yielding identical output. "BH" and "fdr" are one and the same in the table above and for this description.

To perform the step-down, the unadjusted p-values are re-ranked from highest to lowest and multiplied by a ratio of the total comparisons to their original low-to-high index value. The net effect is that the highest p-value is uncorrected, the lowest p-value get’s the full Bonferroni, while the others are corrected between these two extremes. Cumulative minima are reported in the output.

The net net outcome is that the Benjamini-Hochberg procedure is even more liberal than the Hochberg.

\[\begin{equation} p_{adjust}=\frac{m}{i}\times p \end{equation}\]

Use the Benjamini-Hochberg correction on experiments where you want strong protection against type2 errors. For example, if the experiment is designed to be more exploratory, more like a hit screen than to a true hypothesis test. Indeed, you’ll find the BH/fdr used a lot in testing for hits in procedures on large “-omics” data sets.

35.1.1.1.7 Benjamini-Yekutieli

Selecting the "BY" method applies a Hochberg-like step-down based p-value correction that is considerably more impactful on type1 error than the Benjamini-Hochberg (FDR) method.

\[\begin{equation} p_{adjust}=(\sum_{m=1}^{m}\frac{1}{m})\times\frac{m}{i}\times p \end{equation}\]

This strikes an interesting balance between very strong control of type1 error while allowing for a more liberal evasion of type2 error. This is probably a good choice for experiments with objectives that straddle severe hypothesis testing along with hit screening.

35.1.1.2 Examples

No p-value adjustment

Sometimes it’s useful to generate p-values that are not corrected for multiple comparisons. For example, when we wish to create a p-value array that we will next subset to focus in on planned comparisons.

The script below generates a matrix of p-values for t tests of all possible comparisons, none of which are adjusted for multiple comparisons.

Parenthetically, these p-values are what would be observed if the Fisher LSD test were applied. The Fisher LSD test actually does not adjust p-values.

allPairs <- pairwise.t.test(chickwts$weight, chickwts$feed, paired=FALSE, pooled.sd=TRUE, p.adjust= "none")
allPairs
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  chickwts$weight and chickwts$feed 
## 
##           casein  horsebean linseed meatmeal soybean
## horsebean 2.1e-09 -         -       -        -      
## linseed   1.5e-05 0.01522   -       -        -      
## meatmeal  0.04557 7.5e-06   0.01348 -        -      
## soybean   0.00067 0.00032   0.20414 0.17255  -      
## sunflower 0.81249 8.2e-10   6.2e-06 0.02644  0.00030
## 
## P value adjustment method: none

The unadjusted p-value output from the pairwise.t.test function is a matrix, which is important to know when the need arises to pull out specific elements of the analysis.

class(allPairs$p.value)
## [1] "matrix" "array"
allPairs$p.value
##                 casein    horsebean      linseed   meatmeal      soybean
## horsebean 2.067997e-09           NA           NA         NA           NA
## linseed   1.493344e-05 1.522197e-02           NA         NA           NA
## meatmeal  4.556672e-02 7.478012e-06 1.347894e-02         NA           NA
## soybean   6.654079e-04 3.246269e-04 2.041446e-01 0.17255391           NA
## sunflower 8.124949e-01 8.203777e-10 6.211836e-06 0.02643548 0.0002980438

For example, to quickly scan which comparisons are below the p < 0.05 threshold we apply a simple custom extreme function across the matrix:

extreme <- function(x){
  ifelse(x < 0.05, TRUE, FALSE)
}

apply(allPairs$p.value, c(1, 2), extreme)
##           casein horsebean linseed meatmeal soybean
## horsebean   TRUE        NA      NA       NA      NA
## linseed     TRUE      TRUE      NA       NA      NA
## meatmeal    TRUE      TRUE    TRUE       NA      NA
## soybean     TRUE      TRUE   FALSE    FALSE      NA
## sunflower  FALSE      TRUE    TRUE     TRUE    TRUE

So we can easily see that without adjustment all but 3 of the 15 possible comparisons are statistically different.

Subsets of comparisons

Let’s imagine we’ve just started a new postdoc position in a lab that studies chick weights. To set up a new line of research that will span the next few years we need to make a decision on the best way to feed the chicks.

The standard chow protocol is casein. Are there any other chows that would be better?

The experiment is a one-way completely randomized ANOVA design that measures chick weight on 6 different diets including the casein diet. Let’s stipulate the experiment passes an omnibus F test. We now wish to conduct posthoc comparisons to test whether any of the 5 other diets differ from casein.

Here’s a three step procedure for doing just that.

Step1: First, run the pairwise.t.test function, setting the argument p.adjust="none". The output includes a matrix of p-values we’ll name allPairs, providing all possible comparisons.

#just repeating from above
allPairs <- pairwise.t.test(chickwts$weight, chickwts$feed, alternative = "two.sided", p.adjust= "none")

Step2: Select from the allPairs matrix only the p-values that correspond to the comparisons you’d like to make. Name that vector of unadjusted p-values, selectPairs. This takes a bit of cleverness depending on what you want to grab from the matrix.

For example, we only want to compare all of the diets to casein. The comparisons we want are all in the first column. Use your matrix indexing skillz to grab only the unadjusted p-values from that first column:

selectPairs <- allPairs$p.value[, 1]
selectPairs
##    horsebean      linseed     meatmeal      soybean    sunflower 
## 2.067997e-09 1.493344e-05 4.556672e-02 6.654079e-04 8.124949e-01
selectPairs < 0.05
## horsebean   linseed  meatmeal   soybean sunflower 
##      TRUE      TRUE      TRUE      TRUE     FALSE

Step3: Now pass these unadjusted p-values in the selectPairs vector into the p.adjust function. We choose a stringent Bonferroni test because our foreseeable life depends on the outcome of this test. We have low tolerance for type1 error.

The output of this step is a vector of adjusted p-values for the selected group of comparisons.

adjustedPvalues <- p.adjust(selectPairs, method="bonferroni")
adjustedPvalues
##    horsebean      linseed     meatmeal      soybean    sunflower 
## 1.033998e-08 7.466720e-05 2.278336e-01 3.327039e-03 1.000000e+00

Which of these are extreme? If it’s not clear by inspection (or too large), use a simple Boolean:

adjustedPvalues < 0.05
## horsebean   linseed  meatmeal   soybean sunflower 
##      TRUE      TRUE     FALSE      TRUE     FALSE

We reject the null that mean chick weights are the same between casein and horsebean, or linseed or soybean. Chick weights on meatmeal and sunflower diets show no statistical difference from casein.

How would we act on this information? That’s a scientific judgment. Perhaps the sunflower diet is less expensive, or we are from Kansas.

It is important to recognize that we have NOT run a test of equivalence. We cannot conclude that chick weights on casein, meatmeal and sunflower are equivalent. The statistical analysis only suggests that they don’t differ, given these sampling conditions. That sounds like the same thing but it is not. For example, if the sample size were larger we might observe a statistical difference.

Making all possible comparisons

Imagine that we were just exploring different feeds and we were interested in comparing all feeds to all other feeds.

bonf.adjustedAllpairs <- pairwise.t.test(chickwts$weight, chickwts$feed, alternative = "two.sided", p.adjust = "bonferroni")
bonf.adjustedAllpairs
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  chickwts$weight and chickwts$feed 
## 
##           casein  horsebean linseed meatmeal soybean
## horsebean 3.1e-08 -         -       -        -      
## linseed   0.00022 0.22833   -       -        -      
## meatmeal  0.68350 0.00011   0.20218 -        -      
## soybean   0.00998 0.00487   1.00000 1.00000  -      
## sunflower 1.00000 1.2e-08   9.3e-05 0.39653  0.00447
## 
## P value adjustment method: bonferroni

And here’s a quick scan for which of these are now below a FWER p < 0.05. Note how a handful of comparisons that were scored extreme by this simple test above, without a p-value adjustment, are no longer extreme with the p-value adjustment.

extreme <- function(x){
  ifelse(x < 0.05, TRUE, FALSE)
  
}
apply(bonf.adjustedAllpairs$p.value, c(1, 2), extreme)
##           casein horsebean linseed meatmeal soybean
## horsebean   TRUE        NA      NA       NA      NA
## linseed     TRUE     FALSE      NA       NA      NA
## meatmeal   FALSE      TRUE   FALSE       NA      NA
## soybean     TRUE      TRUE   FALSE    FALSE      NA
## sunflower  FALSE      TRUE    TRUE    FALSE    TRUE

In contrast to what we observed running this trick on the unadjusted p-values, with the Bonferroni we go from twelve statistically different comparisons to eight.

35.1.2 Range tests

These range tests are designed to test group means. They do not test the means of differences between two levels of a factor. Group means are irrelevant in related measures designs.

They should only be used as posthoc for completely randomized designs. When the design is related measures, the posthoc testing should be performed using the pairwise.t.test with p-value adjustments, as covered in the section above.

These tests differ in a fundamental way from the p-value adjustments associated with pairwise.t.testing. They operate not on p-values, but on the differences between group means. The p-values are derived from their unique “Studentized” probability distributions of these differences.

Range tests first calculate a critical threshold for differences between group means, then compare every group mean to this critical difference. Statistically different comparisons are those that have differences that are greater than the critical difference.

The adjustment for multiple comparisons comes from probability distribution of a statistic used to calculate these critical differences.

you’ll recognize these procedures as analogous to calculating confidence intervals and for calculating the critical value for a test statistic value to define the boundary of statistical extremeness.

For range tests, that boundary is called the critical difference. Any differences more extreme are ruled as statistically different.

35.1.2.1 How this works

An array of group mean differences can be calculated from a vector of group means.

Suppose we had a way of declaring some critical difference between group means. Any differences observed in our array above that critical difference value would be taken as statistically different.

Imagine a set of ANOVA data comprised of \(N\) total replicates. There are \(n\) replicate values in each of \(k\) groups. The largest group mean for the response is \(\bar y_{max}\) and the smallest is \(\bar y_{min}\). The pooled variance for all replicate values in the set is \(s^2\).

Then the random variable \[q =\frac{\bar y_{max}-\bar y_{min}}{\sqrt\frac{s^2}{n}}\]

has what is called a Studentized range distribution.

Given a confidence level \(1-\alpha\), the number of groups \(k\), and the residual error degrees of freedom \(df=N-k\), a critical value of \(q\) can be derived from an appropriate studentized distribution.

This critical \(q\) value is used to define the critical difference between group means in a data set where the residual variance is \(MS_{residual}\).

For example, the critical difference for the TukeyHSD test is: \[Tukey\ critical\ difference = q_{(conf.level, k, df)}\times\sqrt\frac{MS_{residual}}{n}\]

35.1.2.1.1 Example of Tukey by hand

Imagine a one-way ANOVA analysis where there are 5 groups, each having 4 replicates within. The \(MS_{residual}= 4.0333\) and has \(df = 15\). From these values we can calculate a critical difference between group means as follows.

In the old days, we would find a table for Tukey’s HSD q values in the back of a statistics textbook. R’s qtukey function is useful to calculate a value for \(q\).

q <- qtukey(0.05, 5, 15, lower.tail=F);q
## [1] 4.366985

The critical difference for the TukeyHSD test would be

critdiff <- q*sqrt(4.0333/4); critdiff
## [1] 4.385125

With 5 groups there are \(k(k-1)/2=10\) differences between group means to calculate. For those differences that exceed the value \(critdiff = 4.385125\) we would reject the null that there is no difference between those group means.

35.1.2.1.2 Example doing Tukey with R

Let’s run through a one-way ANOVA using the chickwt data with a TukeyHSD posthoc as follow up.

We do this to produce an aov object that we can pass into the PostHocTest function.

chickwts$ID <- as.factor(1:nrow(chickwts))
my.ezaov <- ezANOVA(
            data = chickwts, 
            wid = ID, 
            dv = weight, 
            between = feed,
            type = 2, 
            return_aov = T, 
            detailed = T)
## Warning: Data is unbalanced (unequal N per group). Make sure you specified a
## well-considered value for the type argument to ezANOVA().
## Coefficient covariances computed by hccm()
# my.ezaov$ANOVA, this is a dataframe
# my.ezaov$Levene, this is also a dataframe
# my.ezaov$aov, this is an aov object that we can pass into posthoc functions. 

Now run the TukeyHSD test:

PostHocTest(my.ezaov$aov, conf.level = 0.95, method="hsd")
## 
##   Posthoc multiple comparisons of means : Tukey HSD 
##     95% family-wise confidence level
## 
## $feed
##                            diff      lwr.ci    upr.ci    pval    
## horsebean-casein    -163.383333 -232.346876 -94.41979 3.1e-08 ***
## linseed-casein      -104.833333 -170.587491 -39.07918 0.00021 ***
## meatmeal-casein      -46.674242 -113.906207  20.55772 0.33246    
## soybean-casein       -77.154762 -140.517054 -13.79247 0.00837 ** 
## sunflower-casein       5.333333  -60.420825  71.08749 0.99989    
## linseed-horsebean     58.550000  -10.413543 127.51354 0.14133    
## meatmeal-horsebean   116.709091   46.335105 187.08308 0.00011 ***
## soybean-horsebean     86.228571   19.541684 152.91546 0.00422 ** 
## sunflower-horsebean  168.716667   99.753124 237.68021 1.2e-08 ***
## meatmeal-linseed      58.159091   -9.072873 125.39106 0.12770    
## soybean-linseed       27.678571  -35.683721  91.04086 0.79329    
## sunflower-linseed    110.166667   44.412509 175.92082 8.8e-05 ***
## soybean-meatmeal     -30.480519  -95.375109  34.41407 0.73914    
## sunflower-meatmeal    52.007576  -15.224388 119.23954 0.22070    
## sunflower-soybean     82.488095   19.125803 145.85039 0.00388 ** 
## 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The test calculates the measured difference between each group. Application of the range test formula to the dataset yields confidence intervals for each comparison of interest. Those that do not include the value for the critical difference are flagged with an asterisk as statistically different. Adjusted p-values are also produced.

We see that the TukeyHSD posthoc test flags eight group mean differences as extreme.

35.1.2.2 Dunnett’s test

This post hoc method differs from those above because it is for conducting multiple dependent comparisons, on just a subset of the group means. For example, use Dunnett’s to compare each of a group of test means back to the negative control mean.

The fewer comparisons don’t spread the allowed FWER as thin as the other options. The following script is configured to compare the means at each level of feed to the mean response to the horsebean feed (to illustrate how to define the control group).

Dunnett’s is nice because it gives you the effect size (diff) and the confidence interval limits for the difference, as well.

Note: diff = the difference between diet means for the compared groups.

The p-values are adjusted for multiple comparisons.

DunnettTest(weight ~ feed, control = "horsebean", data = chickwts)
## 
##   Dunnett's test for comparing several treatments with a control :  
##     95% family-wise confidence level
## 
## $horsebean
##                          diff     lwr.ci   upr.ci    pval    
## casein-horsebean    163.38333 103.203907 223.5628 5.8e-09 ***
## linseed-horsebean    58.55000  -1.629427 118.7294  0.0591 .  
## meatmeal-horsebean  116.70909  55.298874 178.1193 4.2e-05 ***
## soybean-horsebean    86.22857  28.035815 144.4213  0.0016 ** 
## sunflower-horsebean 168.71667 108.537240 228.8961 2.0e-09 ***
## 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

35.1.2.3 Which range test is best?

Just like for the p-value adjustment methods, the range tests operate over a range from fairly liberal to fairly conservative control over type1 error. The best one to use depends upon the overarching scope of the experiment. When it is exploratory and we want protection from type2 error we choose a liberal test. When testing a crucial hypothesis and needing protection from type1 error, we choose a more conservative test.

Fisher’s LSD: Only use the Fisher LSD after a positive ANOVA for a small, 3 group experiment.

A 3 group experiment has a positive control, a negative control, and the test group. A one-way ANOVA of the data yields an extreme F statistic and a null rejection. The ANOVA doesn’t tell us whether that’s due to the positive control or due to the test group yielding a good response.

In fact, there is only one scientific question of interest: Is the test group different from the negative control? In this case, the Fisher LSD is used as a posthoc to answer that question under the type1 error expended in the ANOVA test. A p-value adjustment is unnecessary because the ANOVA test already serves to “protect” the FWER. But it does allow for testing whether the test group differs from a control group.

Newman-Keuls: The most liberal of range tests that do make adjustments. This should be used in exploratory experiments seeking potential hits, where type1 error is less of a concern.

Duncan and TukeyHSD These tests offer moderate protection from type1 and type2 error. As you can see below, although not identical they perform about the same.

Scheffe and Dunn Almost has the ring of a Nashville recording duo, doesn’t it? Or a trendy Beltline tapas place. These offer the most protection against type1 error, and we are also sure to miss some “hits” with these two tests. These two tests should be avoided when the experiment is more exploratory in nature.

(#tab:range test comparisons)Adjusted p-values from range tests applied to the chickwts data; left to right, liberal to conservative.
contrast LSD Newman_Keuls Duncan TukeyHSD Scheffe Dunn
horsebean-casein 0.0000000 0.0000000 0.0000000 0.0000000 0.0000006 0.0000000
linseed-casein 0.0000149 0.0000866 0.0000289 0.0002100 0.0016835 0.0002240
meatmeal-casein 0.0455667 0.0455667 0.0455667 0.3324584 0.5322842 0.6835008
soybean-casein 0.0006654 0.0019010 0.0009509 0.0083653 0.0356963 0.0099811
sunflower-casein 0.8124949 0.8124949 0.8124949 0.9998902 0.9999578 1.0000000
linseed-horsebean 0.0152220 0.0152220 0.0152220 0.1413329 0.2994229 0.2283296
meatmeal-horsebean 0.0000075 0.0000436 0.0000145 0.0001062 0.0009357 0.0001122
soybean-horsebean 0.0003246 0.0009361 0.0004681 0.0042167 0.0206107 0.0048694
sunflower-horsebean 0.0000000 0.0000000 0.0000000 0.0000000 0.0000003 0.0000000
meatmeal-linseed 0.0134789 0.0354947 0.0179077 0.1276965 0.2788785 0.2021841
soybean-linseed 0.2041446 0.2041446 0.2041446 0.7932853 0.8936535 1.0000000
sunflower-linseed 0.0000062 0.0000597 0.0000149 0.0000884 0.0007984 0.0000932
soybean-meatmeal 0.1725539 0.1725539 0.1725539 0.7391356 0.8604074 1.0000000
sunflower-meatmeal 0.0264355 0.0670935 0.0341292 0.2206962 0.4064409 0.3965322
sunflower-soybean 0.0002980 0.0016597 0.0005535 0.0038845 0.0192857 0.0044707

35.2 Reporting the result

It is imperative to state how FWER has been controlled when performing multiple comparisons as an ANOVA follow up.

A surprising number of papers describe analyzing data as using ANOVA, then doing many group comparisons and speckling their figures with many asterisks, without ever mentioning the posthoc procedures. Did they do any? What exactly are these p-values and what exactly do the asterisks represent?

Somewhere we MUST write, "A some way ANOVA was followed by insert posthoc range test or p-value adjustment name here

If true, when describing a set of p-values or confidence intervals in a figure or table legend, always use the phrase, “adjusted for multiple comparisons.” This makes it clear that we are reporting adjusted, rather than unadjusted p-values.

A very common mistake occurs when someone runs the posthoc but acts upon the unadjusted p-values rather than the adjusted p-values. This is probably due to the fact that they find posthoc testing confusing.

35.3 Summary

  • Yes, posthoc testing is confusing, given the many options and LOT’s of output.
  • Everybody agrees it is wise to control FWER, but nobody agrees on how best to do it.
  • There are two basic procedures: p-value adjustments or range tests.
  • Range tests compare group means, and group means are irrelevant in related measures designs. Don’t use range tests on related measure designs.
  • Otherwise, before choosing an option, ask yourself: Is this exploratory or is this a severe test?
  • For exploratory experiments, choose the liberal procedures. They are the ones on the left in the tables above.
  • For severe tests, choose conservative procedures. They are the ones on the right in the tables above.

  1. Hommel, Biometrika (1988) 75, 2, pp383-6↩︎