Bioinformatics Asked by yuval on July 1, 2021
On GISAID they classified the coronavirus using 4 clades(S, G, V, Other).
I downloaded around 1,000 complete genomes of the coronavirus from GISAID and I would like to classify each one as belonging to one of the 4 clades(S, G, V, Other).
On the top left of the image below is the classification of the 3 clades(S, G, V).
More specifically, the three strains were classified based on three specific variants(This answer was taken from my previous question):
My problem is that using this method of classification for the coronavirus genome I am seeking information on the protein products of the genes in the genome.
So to be more specific here are my concise questions:
It seems to be explained right there in the image you posted:
So, the three strains were classified based on three specific variants:
Answered by terdon on July 1, 2021
The most important thing to say is we are grateful this information has been released. It is important to note these are very large genomes of around 9000 amino acids in total.
The clades are assigned to single amino acid mutations. The tree is built using nucleotide data therefore synonymous (silent) mutations are resulting in phylogenetic resolution and the authors have superimposed amino acid mutations over this tree structure. They have represented the tree as a circular phylogram.
So the reason why there is loads of branches (of a phylogenetic tree) of different sizes is because there are different numbers of silent (mostly 3rd codon wobble) mutations between the isolates. RNA viruses mutate rapidly. There are only 4+ amino acid changes across the entire epidemic according to this analysis.
Nomenclature The clades are named after the amino acid mutation that arises within that group. Its a bit weird, but its okay. Hence S clade means a serine mutation, G clade a glycine mutation etc.... The thing is that S also equals Spike, which is a very important protein (below), so it is confusing because the S clade refers to ORF8. Which gene the amino acid mutation is occurring in is more important than the specific mutation. Secondly the "other" clade is dying out - which is not that informative, but others are holding their course.
Mutation scheme and protein So @terdon is right
Epidemiological interpretation
Epidemiological interpretation is difficult because it doesn't really explain the possible differences in mortality rates between different countries, for example USA appears to have a low mortality rate in comparison to Italy but there is lots of mixing. It is also not easy to read the tree because it is a circular phylogram and I am struggling to read each "clade" (which has a precise phylogenetic diffinition) I have to take their word for it. Secondly is the colour scheme the 'red' swamps every other colour and gives a disproportionate red colour. This is an old epidemiologists trick in heat maps, there precise colour scheme albeit that is relative, appears to convey a "danger" message to the reader, even though mathematically its simply a consequence of the colour scheme.
Fundamentally it is not really clear whether the mutations confer phenotypic changes and these are huge genomes for an RNA virus. However, the G clade needs further investigation.
Summary The battle continues as to whether data collection skews, social skews, age-prevelence skews, climatic skews result in variation of mortality rates between countries? Versus is their a genetic difference in mortality. On this data I'd go for the non-genetic factors.
The answer to your second question is simple, you just translate the triplet codon into amino acids and identify which amino acid is present at the above 3 positions. You then simply denote the clade within the taxon ID of your nucleotide data. MEGA X has an alignment editor that will allow this translation.
Answered by M__ on July 1, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP