Bioinformatics Asked by gringer on February 11, 2021
After discovering a few difficulties with genome assembly, I’ve taken an interest in finding and categorising repetitive DNA sequences, such as this one from Nippostrongylus brasiliensis [each base is colour-coded as A: green; C: blue; G: yellow; T: red]:
[FASTA file associated with this sequence can be found here]
These sequences with large repeat unit sizes are only detectable (and assembleable) using long reads (e.g. PacBio, nanopore) because any subsequence smaller than the unit length will not be able to distinguish between sequencing error and hitting a different location within the repeat structure. I have been tracking these sequences down in a bulk fashion by two methods:
After I’ve found a suspicious sequence, I then want to be able to categorise the repeat (e.g. major repeat length, number of tandem repeats, repetitive sequence). This is where I’m getting stuck.
For doing a “look, shiny” demonstration, I currently have a very manual process of getting these sequences into a format that I can visualise. My process is as follows:
fold -w <width>
and less -S
to visually inspect the sequence with various potential repeat unit widths to find the most likely repeat unit sizeBut that process is by no means feasible when I’ve got thousands of potential repetitive sequences to fish through.
Is there any better way to do this? Given an arbitrary DNA sequence of length >10kb, how can I (in an automated fashion) find both the location of the repeat region, and also the unit length (bearing in mind that there might be multiple repeat structures, with unit lengths from 30bp to 10kb)?
An example sequence can be found here, which has a ~21kb repeat region with ~171bp repeat units about 1/3 of the way into the sequence.
I’ve now seen human sequences with repetitive regions in excess of 10kb (i.e. out of the range of most linked-read applications). My current idea is centred around creating hash tables of short sequences (currently 13-mers) and tracking their location:
Some local repetitive regions may be lost in the statistics with this approach, it’s hard to tell if there are multiple repetitive regions within a single sequence, and if the repeat units are themselves slightly repetitive (enough that a kmer is duplicated within a repeat unit), then the algorithm will under-report the repetitiveness (see step 3).
It could be an idea to fragment the long reads into small sequences, like simulating Illumina reads of 150 bp, and then map these small sequences against the original long reads and extract regions with a high coverage?
Answered by Sergio Arredondo on February 11, 2021
I've sorted the visualisation out. Here are three alternative representations of repetitive structures for the same sequence:
These were generated using the same R script, callable from the command line:
$ repaver.r -style dotplot -prefix dotplot MAOA_chrX_43645715-43756016.fa
$ repaver.r -style profile -prefix profile MAOA_chrX_43645715-43756016.fa
$ repaver.r -style semicircular -prefix semi MAOA_chrX_43645715-43756016.fa
More details about this are in the presentation I gave at Queenstown Research Week, 2018. I also wrote a chapter in a peer-reviewed eBook here.
This is fast enough that I can run it on the nanopore C. elegans genome in about half an hour, producing these plots for each contig. I don't quite have a method to iterate through this plot and pick out the dominant repeat length at each location, but that's a [relatively] simple extension of what I've already done.
With a lot of optimisations for speed and memory consumption, I've now been able to run it on the full human genome. It takes a couple of days on my desktop computer (64GB RAM + SSD swap) to categorise the 100bp repeat structure of the assembled T2T/CHM13-v1.0 chromosomes.
Code here.
Answered by gringer on February 11, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP