Should multiple testing correct with bonferroni ever reduce a p value's size?

Question

I just did a series of 40 t-tests and then proceeded to use bonferroni correction for multiple testing and my P values reduced in size. Why does this happen? I was my impression that multiple tests correction would always result in an increase in p value size.
Before correction:
           substrate              p_value
19           glycan 0.000139091904182433
23 dermatan sulfate 0.000139091904182433
4            chitin 0.000294435140367691
22           xylose  0.00387472305660014
5       beta-glucan  0.00552400891530821
2         cellulose   0.0130881279666714

After correction:
          substrate      p_value
10           glucose 5.110415e-21
29          fructose 1.745709e-20
26            lignin 7.090204e-18
30 cyclomaltodextrin 3.569263e-10
31  lacto-N-tetraose 3.569263e-10
32       hyaluronate 3.569263e-10

Code used to generate the P values:
# loop labeling one substrate as "A" and every other substrate as "B" then doing a T test between then counts
sub_pvals = NULL
for(sub in unique(cazy_cata_melt$Substrate)){
  df= cazy_cata_melt
  df[df$Substrate != sub,]$Substrate = "B"
  df[df$Substrate == sub,]$Substrate = "A"
  input = cbind(substrate = sub, p_value = t.test(value ~ Substrate, data = df)[[3]][1])
  sub_pvals = rbind.data.frame(sub_pvals, input)
}

#correction for multiple testing
    sub_pvals$p_value = p.adjust(sub_pvals$p_value, method = "bonferroni", n = length(unique(cazy_cata_melt$Substrate)))

#ordering the dataframe
sub_pvals = sub_pvals[order(sub_pvals$p_value),]

Data available here: https://pastebin.com/vsbYGkQW

EdM · Answer

The underlying data are counts associated with each of a set of 40 substrates, so putting aside the coding problem (which isn't really on-topic here) the approach has two problems: t-tests aren't correct for count data, and serially comparing each substrate against the mean of all the other substrates isn't a correct way to do these tests.
Count data are best analyzed as Poisson or negative-binomial data. This is possible for example with the glm() function in R. In your case that would be set up similarly to an ANOVA, coding the substrates as levels of a single categorical predictor. The analysis would be performed with an underlying error distribution (needed to asses the significance of any differences) appropriate to count data, for which the normal distribution assumed by ANOVA and t-tests doesn't hold.
You start with the significance of the overall model. If the model as a whole isn't significant you stop and don't proceed to individual comparisons. If the model is significant overall, there are much better (and more powerful) ways to examine differences among the individual substrates; see this answer for an example with count data.

Should multiple testing correct with bonferroni ever reduce a p value's size?

One Answer

Add your own answers!

Ask a Question