Bioinformatics Asked by Darío Rocha on November 5, 2020
I have recently ran in R four geneset enrichment analysis in the same database (TCGA:breast cancer) comparing two intrinsic subtypes. The methods I used were:
The genesets tested are the full combined list of KEGG, GO:BP, GO:MF and GO:CC. In all cases I trimmed the genesets to genesets of length <500, which leaves a total of 21723 genesets tested.
Sadly, results are confusing. For starters mGSZ for some reason is cropping the output and giving results for only around 9700 of those genesets -I’ve already sent an email to the package mainteiner asking about that. But that isn’t the biggest problem, the most shocking thing is that none of the analyses performedt has a good correlation to the results of any of the other analyses. Here I plotted the -log10(adjusted p-value + 0.0001) of each analysis agaisnt every other analysis, with lines indicating -log10(0.05)=1.3 and -log10(0.01)=2.
I was expecting the results to be somehow different, but here I can’t see even something remotely close to a correlation between the different results. It is also evident that MIGSA and GSVA+limma output a lot more significatively enriched genesets than the other two methods.
I have gone through the code time and again, and I can’t find anything errors. Also I have searched for studies doing this kind of direct comparison between methods, without luck so far. Has anybody experienced this kind of inconsistencies between gene set enrichment analysis results with different tools? What could I possible be doing wrong?.
Each of these methods do something different, so it is reasonable to expect different results. The bottom line is that there isn't a single question you can ask when you do an "enrichment test". For example: you can test if a set of genes are more expressed than this other list of genes or they are above the mean levels of any other gene.
Even more, you use the gene ontology, which has an underlying structure (directed acyclic graph) that no method you used takes into account but could modify the results if taken into account. However, the KEGG pathways do not have this structure. So when you want to apply a method mixing both data sources you need to decide if you want to use it or not.
This is also complicated because for most of these methods you can only compare the p-value which depends on the underlying assumptions of the methods (uniform, normal, binomial distribution?)
The latest comprehensive study comparing enrichment methods including some of the methods you used and more is this one (that I know of): Toward a gold standard for benchmarking gene set enrichment analysis.
Last, what would a correlation tell you here? It is know that there is little agreement between methods. You couldn't say that one method is equal to the other unless you tested with several datasets and you had always a correspondence on the p-value.
Correct answer by llrs on November 5, 2020
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP