Bioinformatics Asked by fullmooninu on November 5, 2020
I have a fasta file assembly and combining it with the raw reads we produced a .bam file which I converted to .sam .
The .sam information lines look like this:
A00321:42:HLLVYDSXX:2:2302:6153:3505 99 NODE_1_length_3415511_cov_137.721502 16 60 128M = 607 742 CGATTAGTCCGGCCAAATCGCCGTCGAGCGCAATGAACATAACGGTCTTGCCCTCAGCGCGCAGCGCATCGGCCTTGGCGTCGATTGTGGAGTGCTCGACGCCCATGATGTCCATCATAGCACCATTG FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF RX:Z:TTGAGGGTATAGTAGT QX:Z:FFFFFFFFFFFFFFFF TR:Z:GACACCG TQ:Z:FFFFFFF BC:Z:AGTTGCAG QT:Z:FFFFFFFF XS:i:-10 AS:i:0 XM:Z:0 AM:Z:0 XT:i:1 RG:Z:over_1kb:LibraryNotSpecified:1:unknown_fc:0 OM:i:60
Separated by mandatory fields it would be something like this:
QNAME: A00321:42:HLLVYDSXX:1:1644:2248:3881
FLAG: 99
RNAME: NODE_1_length_3415511_cov_137.721502
POS: 1
MAPQ: 60
CIGAR: 1S127M
RNEXT: =
PNEXT: 536
TLEN: 386
SEQ: ATCGGGTCTGACACCGCGATTAGTCCGGCCAAATCGCCGTCGAGCGCAATGAACATAACGGTCTTGCCCTCAGCGCGCAGCGCATCGGCCTTGGCGTCGATTGTGGAGTGCTCGACGCCCATGATGTC
QUAL: FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
I’m actually interested in the meta data. I want to know how the RX: and BC: fields are distributed across the scaffolds in the original assembly.
I imagined the .sam file already contains the information about the assembly used to produce it. If I’m wrong, I’m sorry and please correct me, I’m just assuming.
What I want to do is, for each read in the .sam file, I find out its position in the assembled scaffold, and I record, Read_ID,Scaffold_ID,Read_Position_Inside_Scaffold,RX,BC
Then I want to use that database to analyse the distribution of RX and BC inside each scaffold.
That’s what I want.
Ultimately what I’m trying to do is evaluate the quality of my assemblies based on the Barcode distribution.
I’m good at programming and parsing, I’m just having trouble figuring out, where, inside the .sam file, can I find the scaffold and scaffold position of each read.
What I want to do is, for each read in the .sam file, I find out its position in the assembled scaffold, and I record, Read_ID,Scaffold_ID,Read_Position_Inside_Scaffold,RX,BC
I'm just having trouble figuring out, where, inside the .sam file, can I find the scaffold and scaffold position of each read.
You already listed all those:
QNAME
RNAME
Pos
Assembly evaluation is a wide topic, but other software you can use includes BUSCO, QUAST or something like LTRi. Here's a more in-depth guide by the SciLife lab in Sweden. Usually, when aligning reads back to an assembly, you care about the mapped insert size distribution compared to the theoretical insert size distribution, which tells you about collapsing or exapnding regions (i.e. structural misassemblies). This is something that tools like FRC_Align do.
Answered by Bastian Schiffthaler on November 5, 2020
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP