Fastq: how can I check if they are from DNA or RNAseq data?

Question

I have (gave me) Illumina fastq files and I do not know if they are DNAseq or RNAseq data. How can I check this? I do not have any report or who to ask.
Many thanks

ATpoint · Answer

There are a few quick'n'dirty ways depending on the type of data.
In any case you want to align your files to a reference genome and then check the distribution of reads, either on a genome browser or with tools such as RSEQC
which calcualtes the fraction of reads aligning to exon, intron, intergenic etc.
RNA-seq, if you use a standard aligner such as bowtie2 or bwa, and you have a higher species that splices its RNA, then you should see most reads aligning to exons, and quite a fraction of reads that span a splice junction being unmapped (because the exon-intron-exon gap is large and non splice-aware tools do not "bridge" this without tremendous drop in mapping quality so the reads goes unmapped).
DNA-seq (if whole genome) should have a somewhat even coverage across the genome on the global scale. Targeted assays such as ATAC-seq should give distinct peaks that you can easily see in a genome browser just by eye.
If it is exome-seq (so targeted DNA-seq via exon pulldown) then it might get tricky because also most reads (like in RNA-seq) should align to exons, but you should not have spliced reads so the 5' part of the read starting in exon1 and 3' part ending in exon2. Instead many reads should also cover the exon-intron boundary.
As said, I would simply align it and check on a browser, I could imagine one can make a sophisticated guess by just looking at the alignments by eye.
A different topic is reliability, because I personally would not touch a dataset with unknown origin, because that also means you have no idea about the protocols that were used for library prep and what the sample actually is, so I wonder which question your analysis is going to answer.

swbarnes2 · Answer

Honestly, ask who made them.  If the reads are from an organism without splicing, it will not be trivial to figure it out with 100% certainty.
If you are calling variants, I don't think it matters much, except that higher organisms have such a small % of their genome transcribed, it should be obvious very quickly if you have the desired coverage across the entire genome, or just the transcripts.

Fastq: how can I check if they are from DNA or RNAseq data?

2 Answers

Add your own answers!

Ask a Question