Bioinformatics Asked on January 18, 2021
I’m a computer scientist teaching algorithms development in the Fall. One of the algorithms we teach is called Edit Distance, and our folklore is that it is used to compare RNA sequences (is this actually true in practice?).
I would like to have students implement the edit distance algorithm and run it on actual SARS-COV-2 sequences, so I’m trying to understand exactly what I get from the GenBank database. I downloaded this one: https://www.ncbi.nlm.nih.gov/nuccore/1798174254
I am looking at the genomic.fna file. So this is apparently the FASTA file format, and the lines that begin with >MN988669.1 … are comments. I see comments like:
>MN988669.1 Severe acute respiratory syndrome coronavirus 2 isolate 2019-nCoV WHU02, complete genome
Followed by an RNA string. Is this the start of a new sequence for a different coronavirus specimen? So I could have students extract each of these and run edit distance and then produce a dendrogram or something? How do I find more information on where the samples came from? Is this the right file to be using, or should I use the gbff file? And are the PDB files interesting to me at all (I actually do know what PDB files are)?
Also, are there any recommended data sets where we could do something like track mutations in the virus (and see, e.g., that the NYC outbreak originated from Europe and not China)? Are there other useful algorithms / data that might be interesting for students to study in this vein? Specifically interesting to me would be graph-search algorithms, minimum spanning trees, and network flow. Also any NP-complete algorithms that we could run backtracking on. Obviously taking the theoretical study of algorithms to something as currently topical as the coronavirus has pedagogical value.
Thanks
EDIT:
Based on comments below here is what is taking shape.
My learning objectives (as they are now shaping up) are:
I would very much appreciate thoughts on the above.
Your understanding of FASTA format is about right. The type of basic problem you're eluding to we term "sequence alignment"- edit distance might be okay for teaching but in practise we use other algorithms, e.g. you might be interested in the Needleman–Wunsch or Smith–Waterman algorithms. Richard Durbin et al. wrote a great book that covers these any much more https://www.amazon.com/Biological-Sequence-Analysis-Probabilistic-Proteins/dp/0521629713
Tracking mutations etc. requires more than just alignment though, see "phylogenetics" (i.e. building genetic trees) and "variant calling". Also checkout what the nextstrain team is doing https://nextstrain.org/ncov/global
In general, looking for practical applications for your algorithms is great, but be very careful before drawing any real world conclusions about the coronavirus outbreak from such analyses
Answered by Chris_Rands on January 18, 2021
The correct method is laborious and it is better to give the students a ready made tree at GISAID to investigate the dispersion of COVID-19 into Europe.
However, a quick approach to getting to the alignment and drawing a tree is easy and will readily complement your established teaching method. What this will give you is a very different phylogeny from edit distances
and you would explain the matrix differences between the approaches. I suspect NCBI use Jukes Cantor distances.
I have provided an example below, which I "rerooted" and represents the closest 100 to you sequence using collapsed clade format (your students can undo this, to examine the contents of a "collapsed clade"). The tree shows the dispersion of strains from the Wuhan seafood market.
There's lots of flexibility and the students can do everything inside ½ hour easily and this would complement your approach. The advantage of my approach to teaching phylogeny and regardless of what you do and how you do it Blast is central to rapidly obtaining alignment data for students and investigators alike. We use different blast options but blasting through is prerequisite to understand the diversity and clean some information on the population structure.
Also, are there any recommended data sets where we could do something like track mutations in the virus (and see, e.g., that the NYC outbreak originated from Europe and not China)?
Yes there are, today this would be Mesquite available here. In my opinion this is a bit advanced. You can track amino acid mutations, which in my view is easier.
Answered by M__ on January 18, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP