Why does the FASTA sequence for coronavirus look like DNA, not RNA?

Question

I'm looking at a genome sequence for 2019-nCoV on NCBI. The FASTA sequence looks like this:

>MN988713.1 Wuhan seafood market pneumonia virus isolate 2019-nCoV/USA-IL1/2020, complete genome
ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAA
CGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAAC
TAATTACTGTCGTTGACAGGACACGAGTAACTCGTCTATCTTCTGCAGGCTGCTTACGGTTTCGTCCGTG
...  
...
TTAATCAGTGTGTAACATTAGGGAGGACTTGAAAGAGCCACCACATTTTCACCGAGGCCACGCGGAGTAC
GATCGAGTGTACAGTGAACAATGCTAGGGAGAGCTGCCTATATGGAAGAGCCCTAATGTGTAAAATTAAT
TTTAGTAGTGCTATCCCCATGTGATTTTAATAGCTTCTTAGGAGAATGACAAAAAAAAAAAA

Coronavirus is an RNA virus, so I was expecting the sequence to consist of AUGC characters. But the letters here are ATGC, which looks like DNA!

I found a possible answer, that this is the sequence of a "complementary DNA". I read that

The term cDNA is also used, typically in a bioinformatics context, to refer to an mRNA transcript's sequence, expressed as DNA bases (GCAT) rather than RNA bases (GCAU).

However, I don't believe this theory that I'm looking at a cDNA. If this were true, the end of the true mRNA sequence would be ...UCUUACUGUUUUUUUUUUUU, or a "poly(U)" tail. But I believe the coronavirus has a poly(A) tail.

I also found that the start of all highlighted genes begin with the sequence ATG. This is the DNA equivalent of the RNA start codon AUG.

So, I believe what I'm looking at is the true mRNA, in 5'→3' direction, but with all U converted to T.

So, is this really what I'm looking at? Is this some formatting/representation issue? Or does 2019-nCoV really contain DNA, rather than RNA?

gringer · Answer

It's not common to sequence directly from RNA because most sequencing platforms don't have that as an option. Nanopore sequencers do allow this, but I'm not aware yet of any 2019-nCov preprints involving nanopore RNA sequencing. I expect that will change in the next month or so.

Commercial kits exist; there are no insurmountable technical issues with it. Direct RNA sequencing can be done locally, on-site near the point of discovery without sample transfer or culture on a USB-powered device that fits in a pocket (RNA preparation takes about 2 hours). Flow cells that have potentially-infectious RNA inside them can be disposed as biohazard waste. However, the ease at which RNA can be quickly converted to more stable cDNA then amplified to create a much higher concentration DNA sample (which is quicker / more efficient to get results from) means that cDNA is generally preferred for sequencing unless the native RNA is needed (e.g. for looking at RNA base modifications that are destroyed when converting to cDNA).

There is a paper on coronavirus direct RNA sequencing with nanopore here; I would expect that 2019-nCoV would have a similar difficulty. The zika virus has an extremely low viral load in human blood, but has also been sequenced via direct RNA sequencing of [carefully] cultured cells (see here).

Regardless of whether or not RNA sequencing has actually been carried out, most genetic data analysis programs will only work with A/C/G/T sequences, so it's conventional to replace any U parts of an RNA sequence with T for data storage. There's no loss of information by doing this, as T replaces all Us in the RNA sequence.

Konrad Rudolph · Answer

If this were [cDNA], the end of the true mRNA sequence would be ...UCUUACUGUUUUUUUUUUUU, or a "poly(U)" tail.

A cDNA sequence, maybe confusingly, refers to the coding strand of the cDNA (despite being called “complementary”). So while cDNA is the result of reverse transcribing RNA into DNA, by convention it has the same strandedness as the original RNA. That’s why what you’re seeing is read in 5′→3′ direction and contains a visible poly(A) tail. Having a single conventional reading direction for all archived sequences vastly simplifies data handling, and reduces errors.

In fact, since cDNA is double-stranded, there is no a priori reason why a computer-stored cDNA sequence should refer to the template strand (i.e. the opposite strand, which is synthesised from the RNA during reverse transcription).

The whole (simplified) synthesis process of cDNA is as follows:

A primer hybridises to the template RNA molecule.
The RNA template is reverse transcribed into DNA using reverse transcriptase.
The RNA template is removed.
A complementary strand is transcribed along the (currently) single-stranded cDNA, resulting in a double-stranded cDNA product.

M__ · Answer

That is the correct sequence for 2019-nCov. Coronavirus is of course an RNA virus and in fact, to my knowledge, every RNA virus in Genbank is present as cDNA (AGCT, i.e. thydmine) and not RNA (AGCU, i.e. uracil).

The reason is simple, we never sequence directly from RNA because RNA is too unstable and easily degraded by RNase. Instead the genome is reverse transcribed, either by targeted reverse transcription or random amplification and thus converted to cDNA. cDNA is stable and is essentially reverse transcribed RNA.

The cDNA is either sequenced directly or further amplified by PCR and then sequenced. Hence the sequence we observe is the cDNA rather than RNA, thus we observe thymine rather than uracil and that is how it is reported.

ATpoint · Answer

Most sequencing experiments, be it Illumina-based next-generation-sequencing or Sanger sequencing uses DNA as template, not RNA. Even if this virus is RNA-based it would be reverse-transcribed prior to any sequencing experiment. Therefore the output is DNA and this is what NCBI provides here.

Answered by ATpoint on November 12, 2021

Why does the FASTA sequence for coronavirus look like DNA, not RNA?

4 Answers

Add your own answers!

Ask a Question