Bioinformatics Asked on October 6, 2021
I guess the title is self-explanatory. I’d like to find the reference transcriptome files for a few human viruses. My RNA-seq samples are from human tumor cells that were infected with viruses. The main part of my analysis is quantifying human gene expression from these samples. However, I also would like to find what proportion of my raw data is originated from virus RNA.
So, I’d like to learn how to find the reference transcriptome data for human viruses. Moreover, I appreciate it if you could give me some insight into your workflow for approaching this problem. Finally, if there is any specific command line, R, or python tool for this specific task, I’d like to know about it.
You might be interested in the viral sequence data from CATCH. There are various different viral subsets available.
You could also try the NCBI virus website, filtering on human hosts. There's a "Download" link on this page that allows you to download the full sequence set:
You've asked a very specific question, so I'm giving you a very specific answer. It seems that what you actually want to do is different from the self-explanatory question.
For the purpose of creating an index for Kallisto or Salmon, it should be okay to use the full dataset. Those programs are designed to handle eukaryotic transcriptomes as input, which typically include many gene copies and isoforms, so transcript / genome duplication shouldn't be much of an issue.
Without knowing more about your particular application, it's difficult to provide additional advice. From what you've written, i.e. "what proportion of my raw data is originated from virus RNA", a hit to any viral sequence would be sufficient. For example, if you want any more specific information about viral proportion / species, then you should be using a metagenomic tool like Kraken2, rather than Kallisto / Salmon.
Answered by gringer on October 6, 2021
I found a possible answer to my own question by reading this biostar page and also finding an NCBI page for downloading the reference genome file for viruses.
Genome Assembly and Annotation report
on the new page. You will be redirected to a page like this with some information on reference files submitted to NCBI. The last column in this table is for links to the NCBI FTP site. There are two options to access RefSeq or GenBank FTPs. You can choose either of them. I chose RefSeq. Then, you will be redirected to a page like this with different fasta files for that specific virus.GCF_002815995.1_ASM281599v1_cds_from_genomic.fna.gz
is the transcriptome file annotated with CDS info which can be used for Kallisto alignment.Note: If you couldn't find your desired virus genome in the NCBI page mentioned above, you can proceed with the method for generating your desired transcriptome file from genome and annotation file using gffread
package. The details on this method are provided in the answer to the Biostar question mentioned above.
Answered by Reza Rezaei on October 6, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP