Regarding canonical analysis of principal coordinates (CAP) for linking microbial data with elements

Question

I used the below code to do CAP analysis to link microbial data with elements, but not sure how to do statistical analysis to show which element variation is responsible most for the difference in microbial composition. I can only able to see two axis in figure so how I can get complete information about the variation based on these mineral element and individually like which mineral responsible most for microbial diversity?
    # CAP ordinate
    cap_ord <- ordinate(
        physeq = phyloseq_prop, 
        method = "CAP",
        distance = bray_not_na,
        formula = ~ ~ P + Mn + Cu +Zn + Rb + Mg )

# CAP plot
    cap_plot <- plot_ordination(
      physeq =phyloseq_prop , 
      ordination = cap_ord, 
        color = "SampleType", 
        axes = c(1,2)
    ) + 
        aes(shape = Region) + 
        geom_point(aes(colour = SampleType), alpha = 0.4, size = 4) + 
        geom_point(colour = "grey90", size = 1.5) + 
        scale_color_manual(values = c("#CBD588", "#5F7FC7", "orange","#DA5724", "#508578", "#CD9BCD",
       "#AD6F3B", "#673770","#D14285","#652926", "#C84248", 
      "#8569D5")
        )
        
        # Now add the environmental variables as arrows
    arrowmat <- vegan::scores(cap_ord, display = "bp")
    
    # Add labels, make a data.frame
    arrowdf <- data.frame(labels = rownames(arrowmat), arrowmat)
    
    # Define the arrow aesthetic mapping
    arrow_map <- aes(xend = CAP1, 
        yend = CAP2, 
        x = 0, 
        y = 0, 
        shape = NULL, 
        color = NULL, 
        label = labels)
    
    label_map <- aes(x = 1.3 * CAP1, 
        y = 1.3 * CAP2, 
        shape = NULL, 
        color = NULL, 
        label = labels)
    
    arrowhead = arrow(length = unit(0.02, "npc"))
    
    pdf("cap2.pdf", height=8, width=8)
    # Make a new graphic
    cap_plot + 
      geom_segment(
        mapping = arrow_map, 
        size = .5, 
        data = arrowdf, 
        color = "gray", 
        arrow = arrowhead
      ) + 
      geom_text(
        mapping = label_map, 
        size = 4,  
        data = arrowdf, 
        show.legend = FALSE
      )
      
      dev.off();

cap_ord
    
     Call: capscale(formula = distance ~  P + Mn + Cu +Zn + Rb + Mg, data = data)

Inertia Proportion Rank
Total          2.069e+01  1.000e+00
Constrained    5.240e+00  2.533e-01   10
Unconstrained  1.545e+01  7.468e-01   97
Imaginary     -1.090e-03 -5.268e-05    1
Inertia is squared Bray distance

Eigenvalues for constrained axes:
  CAP1   CAP2   CAP3   CAP4   CAP5   CAP6   CAP7   CAP8   CAP9  CAP10
2.5733 0.8789 0.5647 0.3797 0.3037 0.1873 0.1243 0.1030 0.0800 0.0453

Eigenvalues for unconstrained axes:
 MDS1  MDS2  MDS3  MDS4  MDS5  MDS6  MDS7  MDS8
4.790 1.343 1.138 0.916 0.784 0.649 0.518 0.501
(Showing 8 of 97 unconstrained eigenvalues)

Many thanks

Maximilian Press · Accepted Answer

I don't think that you can say that the 25% can be said to relate to the chemical variable specifically; rather, that 25% is the variance that is explained by all constrained axes. While these axes are drawn in such a way that they maximize variance associated with your independent variables, I do not think that they are directly interpretable as being 100% due to the elements.
The number of constrained axes is a parameter chosen as an input to the CAPscale method:

An extremely important point is to determine how many PCO axes should be retained; i.e., what should be the choice for $m$? One might be concerned that the method could suffer from a poor choice for $m$. If $m$ is too small, then there may be some ecologically important information in the data that will not be included in the canonical analysis. On the other hand, if $m$ is too large relative to $N$, then a misleading canonical plot could result. For example, we cannot set $m = N - 1$, for then the squared canonical correlations would all be equal to 1.0, because we have $N$ points!

So the value of the variance there will vary depending on how many constrained axes you use. If I understand right, you could obtain 100% distance variance explained by choosing as many constrained axes as you have observations.
I think that it is worth looking into how others have solved this problem. For example, one has to first decide whether or not this 25% is an unexpectedly large number, i.e. compute  p-values (I am not crazy about p-values in most contexts but here they could serve as a helpful heuristic), as suggested by the CAPscale paper:

As a further option, one may test the hypothesis of either (1) no significant differences in multivariate location among groups (CDA), or (2) no significant relationship with quantitative environmental or other variables (CCorA). This is done by using the trace statistic (sum of canonical eigenvalues (sum of squared canonical correlations; see the Appendix) and obtaining a P value by permutation (e.g., Anderson 2001b).

This seems to be directly implemented as the anova.cca() function in the vegan package (I believe that they have decided that CAP is a flavor of CCA for that package). I understand the motivation to get % variance as it's intuitive, but in complex statistical analysis of distances it is a little bit less clear what it really means. Some kind of comparison to a null hypothesis is probably necessary to say whether it is meaningful.
For how others have approached similar questions you can see this discussion. For other implementations that may be more user-friendly, you might consider the vegan package, which is a richer toolbox for ordination in ecology (even if it is less focused on metagenomics). Here is the CAPscale function in vegan. The phyloseq authors have correctly decided to build large pieces of phyloseq out of vegan, so I think that you can pass phyloseq back into vegan functions and e.g. use those analyses directly on your cap_ord object. But see that link for more info.
For a little more easy-to-understand background on statistics of distance methods (particularly in microbiomes), you might see here.

Regarding canonical analysis of principal coordinates (CAP) for linking microbial data with elements

One Answer

Add your own answers!

Ask a Question