Bioinformatics Asked on November 10, 2021
I have around one million short sequences (21 bp to several 100s of basepairs) for which I need to identify all occurrences of in 20-30x coverage noisy long reads (both pacbio and ONT). All of the short sequences and long reads are derived from the same individual, and there are enough short sequences given the organism’s genome size that each long read should have at least one or multiple short sequences.
Given three short sequences, AA
& BBBB
& CCCCCC
, how to identify all occurrences of these sequences in long, uncorrected reads.
Sequences to identify in long reads:
- AA
- BBBB
- CCCCCC
Long Reads:
- 1: ------AA-----------------------CCCCCC------
- 2: ------AA--------BBBB-----------------------
- 3: ---BBBB-----------CCCC--
- 4: ------------------AA-----
In this example, the output sam file will contain high MAPQ hits for short sequence AA
to reads (1, 2, 3), for short sequence BBBB
to reads (2, 3). For short sequence CCCCCC
there will be a high MAPQ hit to read 1 and a lower MAPQ hit to read 3
bwa mem
does not output multiply-mapped in a predictable manner that fits this use case https://www.biostars.org/p/304614/blat
seems to output many hits for a single short sequence, but there are no out-of-the-box parameters for mapping short reads to long reads.LAST has given the best results for me when I've tried to do this, although I agree with @user172818 that it's not a good idea to map really short reads. This is due to a combination of natural sequence duplication in long DNA sequences (e.g. see here), as well as abundant base calling differences present in single-molecule sequencing.
Minimising error is not necessarily going to be the best option, and concentrating only on unique hits will miss a lot of signal. There are frequently multiple identically plausible positions, even at zero error, and the long reads will have their own associated errors.
I also find the limitations a little odd. Canu can do read-level correction on long reads based on overlapping reads, and if assembly is possible, then nanopolish can be used to correct nanopore reads for systematic base-calling error introduced by the methylation of unamplified template.
Answered by gringer on November 10, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP