How to get transcriptome FASTA file for viruses for Kallisto pseudo-alignment?

Question

I guess the title is self-explanatory. I'd like to find the reference transcriptome files for a few human viruses. My RNA-seq samples are from human tumor cells that were infected with viruses. The main part of my analysis is quantifying human gene expression from these samples. However, I also would like to find what proportion of my raw data is originated from virus RNA.
So, I'd like to learn how to find the reference transcriptome data for human viruses. Moreover, I appreciate it if you could give me some insight into your workflow for approaching this problem. Finally, if there is any specific command line, R, or python tool for this specific task, I'd like to know about it.

gringer · Answer

You might be interested in the viral sequence data from CATCH. There are various different viral subsets available.
You could also try the NCBI virus website, filtering on human hosts. There's a "Download" link on this page that allows you to download the full sequence set:
https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?VirusLineage_ss=Viruses,%20taxid:10239&HostLineage_ss=humans,%20taxid:9605&SeqType_s=Nucleotide
You've asked a very specific question, so I'm giving you a very specific answer. It seems that what you actually want to do is different from the self-explanatory question.
For the purpose of creating an index for Kallisto or Salmon, it should be okay to use the full dataset. Those programs are designed to handle eukaryotic transcriptomes as input, which typically include many gene copies and isoforms, so transcript / genome duplication shouldn't be much of an issue.
Without knowing more about your particular application, it's difficult to provide additional advice. From what you've written, i.e. "what proportion of my raw data is originated from virus RNA", a hit to any viral sequence would be sufficient. For example, if you want any more specific information about viral proportion / species, then you should be using a metagenomic tool like Kraken2, rather than Kallisto / Salmon.

Reza Rezaei · Answer

I found a possible answer to my own question by reading this biostar page and also finding an NCBI page for downloading the reference genome file for viruses.

You should go to this NCBI page and search for your target virus.
Then click on the virus name and you will be redirected to a page like this which gives you an Entrez records table. You should click on the direct link in the genome row in the table.
Click on Genome Assembly and Annotation report  on the new page. You will be redirected to a page like this with some information on reference files submitted to NCBI. The last column in this table is for links to the NCBI FTP site. There are two options to access RefSeq or GenBank FTPs. You can choose either of them. I chose RefSeq. Then, you will be redirected to a page like this with different fasta files for that specific virus.
the file with the name GCF_002815995.1_ASM281599v1_cds_from_genomic.fna.gz is the transcriptome file annotated with CDS info which can be used for Kallisto alignment.

Note: If you couldn't find your desired virus genome in the NCBI page mentioned above, you can proceed with the method for generating your desired transcriptome file from genome and annotation file using gffread package. The details on this method are provided in the answer to the Biostar question mentioned above.

How to get transcriptome FASTA file for viruses for Kallisto pseudo-alignment?

2 Answers

Add your own answers!

Ask a Question