samtools / bamUtil | Meaning of as Reference Name

Question

I have been working with several tools on bam files recently and I am not sure how I should interpret some of the outputs.

samtools idxstats mapped.sorted.bam

Chr1    15096   0   0
Chr2    33397   0   0
Chr3    43888   41  0
Chr4    20819   1   0
*   0   0   34

Documentation says third and fourth column give "mapped reads" and "unmapped reads". So I interpreted the last line (*) as the number of unmapped reads as this is the only line with a non 0 value at this field.

Then I used

bam splitChromosome --in mapped.sorted.bam --out test/aln.

Reference Name: chr6 has 6 records
Reference Name: Chr7 has 1 records
Reference Name: Chr3 has 41 records
Reference Name: Chr4 has 1 records
Reference Name: * has 34 records

And looking at the bams outputted (one per reference sequence) in test/ I find an aln.UnknownChrom.bam that I am not sure where it comes from. I actually found 97 unique reads ID in it, where I was waiting for 34 or less.

So my first question is : Does "*" actually means "all reads that didn't mapp" and this information is stored in way I didn't know about in BAM files ?

And my second question would be : Anybody knows where this aln.UnknownChrom.bam comes from when using splitChromosome ? It is a source of error in a pipeline I am building and I'd like to get rid of it, if these are not actual mapping informations.

DavyCats · Accepted Answer

As per the SAM/BAM format specifications:

"RNAME: Reference sequence NAME of the alignment. If @SQheader lines are present, RNAME(if not ‘*’) must be present in one of the SQ-SN tag. An unmapped segment without coordinate has a ‘*’ at this field.  However, an unmapped segment may also have an ordinary coordinate such that it can be placed at a desired position after sorting.  If RNAME is ‘*’, no assumptions can be made about POS and CIGAR." -- http://samtools.github.io/hts-specs/SAMv1.pdf, emphasis mine

So, yes, these are the unmapped reads.

I haven't used this splitChromosome utility before myself, but as it splits the BAM file into one BAM file per chromosome, I suspect it simply also creates a BAM file for the unknown (*) chromosome. Why the numbers don't match I can't tell.

How to deal with that file in your pipeline will depend on what the pipeline is meant to do. But if you aren't interested in these reads, I suppose you could simply remove it before continuing to the next step.

samtools / bamUtil | Meaning of as Reference Name

One Answer

Add your own answers!

Ask a Question