Add tags to the read pairs in a bam file

Question

I have a bam file from paired-end sequencing, in which each R1 has the RX tag assigned (via UMItools group) denoting a Unique Molecular Identifier (UMI).
I want to call molecular consensus reads using fgbio but fgbio GroupReadsByUmi gives me an error, because it expects both R1 and R2 to have the RX tag set.
I would like a tool that will go through my bam file and assign each R2 the same RX tag as the R1.

Maximilian Press · Answer

The best thing would be to figure out how to do this in UMItools. I don't know that tool at all, but it seems that it is supposed to be able to deal with / tag both reads in paired data. e.g. did you use --paired flag, etc.
I would suggest investigating that route first before you look at the next part of my answer.

Per Devon Ryan's comment suggestion I would suggest a workflow similar to this:

sort bam file by query name (samtools sort -n) to get the R1s immediately following the R2s
Write a short script/command that

reads through the bam file and for each line
looks up the tag in the R1, prints R1 line
reads the R2, substitutes the RX tag in, prints R2 line

pipe that through samtools view and possibly samtools sort back to a bam file of the same type that you started with.

My pysam is a little rusty but something like this might work
import pysam

bam = pysam.AlignmentFile("bam", "rb")
# print header
print(bam.header)

tag = None
for read in bam:
    r2 = False
    if tag:
        read.assign_umi_tag() # assign to read- not sure where/how
        r2 = True
    else:
        tag = read.lookup_umi_tag()  # i don't know where this tag is in bam format
    print(read)
    if r2:
        tag = None

I haven't tested that in any way, caveat emptor.
I am not personally familiar with UMItools and how it assigns tags so can't suggest anything more specific, sorry.

Add tags to the read pairs in a bam file

One Answer

Add your own answers!

Ask a Question