Bioinformatics Asked on December 14, 2020
I am posting this message because I need clarity about the analysis I want to perform.
In my analysis, I have a homologous gene in 13 species, and I would like to evaluate the selection pressure of this gene (its dN/dS) in 3 species of the tree and test if this value is significantly different from the neutral model (dN/dS = 1).
So here is the protocol I am considering, and I would like you to tell me if I am right or not:
To calculate this selection pressure, the ratio of non-synonymous to synonymous substitution rates (dN/dS) must be calculated.
For this I use CODEML implemented in ete3.
So I need :
A phylogenetic tree of the gene (performed with the AA alignment of the gene via Iqtree)
A codon alignment (performed with the DNA alignment of the gene with pal2nal)
Mark branches in the tree.
Here is the exemple where I want to get the dN/dS value of SP5,6 and 7.:
| | /-|
| | | -SP4
| | /-SP5
/-| - | /-SP6
| | -|
| | |
| | -SP7
| | /-SP8
| | -SP11
And I want to calculate the dN/dS of
SP7 in order to confirme the hypotesis that the gene is under strong purifying selection on this 3 species.
I mark the SP5, SP6 and SP7, run a
free-branch model and look at the Codeml output (the dN/dS value should be the same for the 3 marked species right?) Then this value is kind of the mean dN/dS value between three leafs?
Suppose I get an omega = 0.15
If I want to test that this value is significantly different from 1, I need to mark the same sheets and run the neutral model, then compare the two models (
free) using chi2, and if the p-value is
<0.5, then the dN/dS value (0.15) is significantly different from 1.
I’m right ?
The following is not the clearest explanation of dN/dS humankind has seen, but I hope you get the gist and core idea of likelihood is absolutely correct. I am wondering if you are thinking of a different test to the way I would do this calculation. This ain't a text book clarity answer.
Anyway what you are proposing is a likelihood ratio test (LRT) for selection based on the parameterisation of omega. Your idea isn't quite correct ... its not dN/dS value you are looking at to assess significance, its the likelihood. This point is critical and is the most important difference in our thinking on this test.
LRT works because its a log likelihood and subtracting two log likelihood is the same as dividing them (I forget the theory - but you subtract the likelihood to work out the p-value via Chi squared). There's some theory about why the Chi-squared distribution works but I can't remember it.
You need to work out the degrees of freedom which is the parameter differences between 'free branch model' and the neutral model and this value is LOW. If the neutral model is the "global model", then its 2 degrees of freedom - IN YOUR TREE. This is because the SP you are looking at form a single clade (monophyly). From memory you stick a #1 on the branch leading into the clade, ie. defining the node and thats your 'model' on the treefile. Here you have two dN/dS one for the whole tree excluding your clade and one for your taxa. This gives you a likelihood.
Note you could stick a #1 on each branch within the clade, i.e. the treefile. That means something different - like every branch was subject to ... in this case purifying selection ... not just the common ancestor.
You now work out the likelihood of the 'global' model... i.e. remove the #1 and make sure it gives you only 1 dN/dS value.
Subtract the likelihoods and multiple by two - if its LRT (super important).
You now need to work out the degrees of freedom. If its the same model but one has an additional node with a different dN/dS ... the df = 2. If you've dumped a #1 on every branch in the clade ... I still think df = 2 - you need to check.
If the difference in likelihood results in p < 0.05 you are correct the difference is significant ... whatever that difference might be.
If the neutral model is some different calculation of omega within the population genetics process (see here) then it will be the parameter differences between your model and the 'neutral' model.
To reiterate I like simulations so much I type in reiterations ... So again the LRT usual will usually subtract one likelihood from another and THEM multiply the probability by two ... and compare the result on the Chi squared table for the difference in degrees of freedom. And again what you are looking at is not the value of dN/dS but the difference in LIKELIHOOD. That is very important. You need to be careful with the degrees of freedom on an LRT because they can introduce an additional subtraction constant, test dependings (its ages since I've done one so I can't remember).
If the test is significant and one dN/dS is x for the clade and y for the rest of the tree, then that differences is significant and if y is >1 (bit unlikely) and x < 1 then sure thats purifying selection. Usually clades evolve via positive selection rather than purifying selection.
From memory that is my understanding ... but if you have a neutral model of omega and a purifying model of omega ... yeah sort of, but usually thats for a 'global' model, i.e. no differences within the tree. In your case your interested in a specific clade. The idea is still the same its measured through likelihood.
Its so long since I've done this, but I do caution not to confuse two different tests.
I do understand Yang et al did an MCMC for dN/dS ... what I'm thinking of here is maximum likelihood. You'd need to watch for that carefully and thats what you are referring to when you say Chi-squared, because that is LRT 100%. LRT is maximum likelihood.
Correct answer by M__ on December 14, 2020
Get help from others!