Bioinformatics Asked by Emma Athan on April 15, 2021
I have (gave me) Illumina fastq files and I do not know if they are DNAseq or RNAseq data. How can I check this? I do not have any report or who to ask.
Many thanks
There are a few quick'n'dirty ways depending on the type of data. In any case you want to align your files to a reference genome and then check the distribution of reads, either on a genome browser or with tools such as RSEQC which calcualtes the fraction of reads aligning to exon, intron, intergenic etc.
RNA-seq, if you use a standard aligner such as bowtie2 or bwa, and you have a higher species that splices its RNA, then you should see most reads aligning to exons, and quite a fraction of reads that span a splice junction being unmapped (because the exon-intron-exon gap is large and non splice-aware tools do not "bridge" this without tremendous drop in mapping quality so the reads goes unmapped).
DNA-seq (if whole genome) should have a somewhat even coverage across the genome on the global scale. Targeted assays such as ATAC-seq should give distinct peaks that you can easily see in a genome browser just by eye.
If it is exome-seq (so targeted DNA-seq via exon pulldown) then it might get tricky because also most reads (like in RNA-seq) should align to exons, but you should not have spliced reads so the 5' part of the read starting in exon1 and 3' part ending in exon2. Instead many reads should also cover the exon-intron boundary.
As said, I would simply align it and check on a browser, I could imagine one can make a sophisticated guess by just looking at the alignments by eye.
A different topic is reliability, because I personally would not touch a dataset with unknown origin, because that also means you have no idea about the protocols that were used for library prep and what the sample actually is, so I wonder which question your analysis is going to answer.
Answered by ATpoint on April 15, 2021
Honestly, ask who made them. If the reads are from an organism without splicing, it will not be trivial to figure it out with 100% certainty.
If you are calling variants, I don't think it matters much, except that higher organisms have such a small % of their genome transcribed, it should be obvious very quickly if you have the desired coverage across the entire genome, or just the transcripts.
Answered by swbarnes2 on April 15, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP