Bioinformatics Asked by Paul Endymion on December 25, 2020
I have been working with several tools on bam files recently and I am not sure how I should interpret some of the outputs.
samtools idxstats mapped.sorted.bam
Chr1 15096 0 0
Chr2 33397 0 0
Chr3 43888 41 0
Chr4 20819 1 0
* 0 0 34
Documentation says third and fourth column give “mapped reads” and “unmapped reads”. So I interpreted the last line (*
) as the number of unmapped reads as this is the only line with a non 0 value at this field.
Then I used
bam splitChromosome --in mapped.sorted.bam --out test/aln.
Reference Name: chr6 has 6 records
Reference Name: Chr7 has 1 records
Reference Name: Chr3 has 41 records
Reference Name: Chr4 has 1 records
Reference Name: * has 34 records
And looking at the bams outputted (one per reference sequence) in test/
I find an aln.UnknownChrom.bam
that I am not sure where it comes from. I actually found 97 unique reads ID in it, where I was waiting for 34 or less.
So my first question is : Does “*” actually means “all reads that didn’t mapp” and this information is stored in way I didn’t know about in BAM files ?
And my second question would be : Anybody knows where this aln.UnknownChrom.bam
comes from when using splitChromosome
? It is a source of error in a pipeline I am building and I’d like to get rid of it, if these are not actual mapping informations.
As per the SAM/BAM format specifications:
"RNAME: Reference sequence NAME of the alignment. If @SQheader lines are present, RNAME(if not ‘*’) must be present in one of the SQ-SN tag. An unmapped segment without coordinate has a ‘*’ at this field. However, an unmapped segment may also have an ordinary coordinate such that it can be placed at a desired position after sorting. If RNAME is ‘*’, no assumptions can be made about POS and CIGAR." -- http://samtools.github.io/hts-specs/SAMv1.pdf, emphasis mine
So, yes, these are the unmapped reads.
I haven't used this splitChromosome
utility before myself, but as it splits the BAM file into one BAM file per chromosome, I suspect it simply also creates a BAM file for the unknown (*
) chromosome. Why the numbers don't match I can't tell.
How to deal with that file in your pipeline will depend on what the pipeline is meant to do. But if you aren't interested in these reads, I suppose you could simply remove it before continuing to the next step.
Correct answer by DavyCats on December 25, 2020
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP