Bioinformatics Asked by L R Joshi on November 1, 2020
I am analyzing viral metagenomics data (Illumina Miseq) for the first time. I have used Ray for de novo viral genome assembly before but I haven’t done metagenomics analysis before.
I know that there are some tools like Metavelvet for metagenomics. But I still need to find time to learn those tools.
For now, can I do quick and dirty analysis using following method ?
De novo assemby using Ray
Create a personal database of all the viral ref genomes using NCBI database
Blast all the contigs generated by Ray against personal database.
I assume this should give me some idea about the viral genomes present in my reads. What could be the potential flaws of this method in comparison to the tools like Metavelvet that are dedicated to metagenomics?
The specifics will depend on the experimental setup and data, but a few general comments...
1) Metagenomic data is often large volume and high in microbial diversity but low and uneven in coverage; single genome assemblers may suffer particularly since abundant (or closely related) species/strains will tend to be interpreted as repeats within a genome rather than deriving from different genomes, leading to assembly errors. Checkout CAMI for some recent benchmarks of mategenomic assmeblers.
2) You might want to think about filtering your database to certain viruses, e.g. you don't expect to find RNA viruses in DNA sequencing. If the sample is an animal body site (like human gut) then consider restricting to viruses found in that organism (if it's an environmental sample this may not be possible).
3) BLAST will be slow, consider DIAMOND for a drop in faster replacement for BLASTP. Even faster approaches use only specific marker genes (like Metaphlan2 or Mocat2) or k-mer methods (like kraken or clark). Again CAMI benchmarks many of these and other metagenomic taxonomic classifiers.
Otherwise, I think possibly a missing step is filtering the non-viral sequences. You haven't said if your sequencing was enriched for viruses or depleted for bacteria/host prior to sequencing but if it isn't then the vast majority of your sequences will likely be non-viral (e.g. bacterial or host). Removing or marking these early can speed up downstream analysis and remove false positives (e.g. the human genome contains integrated viruses, bacteria contain prophages etc.). Also it is normally important to look for and remove potential contaminants like lab reagents or cloning vectors.
Correct answer by Chris_Rands on November 1, 2020
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP