Bioinformatics Asked by Stanley Ho on August 22, 2021
I have bacterial chromosomes/plasmids that I have downloaded off refseq in a multi-fasta that I would like to fragment into non-overlapping fragments of between 1 kb and 15 kb.
I have tried googling for related tools but to no avail. Does anyone know how I would achieve this? Something very similar is done in this paper:
https://link.springer.com/article/10.1007/s40484-019-0187-4
Thanks in advance!
You could use numpy.random.randint
to sample integers from a uniform distribution, from 1000 to 15000 nt, extracting subsequences of that length from a linearized multi-FASTA file.
Here's some code:
#!/usr/bin/env python
import sys
import numpy as np
low = int(sys.argv[1])
high = int(sys.argv[2])
def process_record(header, sequence, low, high):
next = 0
fragment = 1
while next < len(sequence):
new_next = np.random.randint(low, high) + next
if new_next > len(sequence) - high:
new_next = len(sequence)
sys.stdout.write('{}:{}n{}n'.format(header, fragment, sequence[next:new_next]))
next = new_next
fragment += 1
header = ""
sequence = ""
for line in sys.stdin:
if line.startswith('>'):
if len(sequence) > 0:
process_record(header, sequence, low, high)
sequence = ""
header = line.rstrip()
else:
sequence += line.rstrip()
if len(sequence) > 0:
process_record(header, sequence, low, high)
To use it:
$ ./13879.py 1000 15000 < in.fa > out.fa
Fragments will be disjoint (non-overlapping, contiguous) subsequences.
I add a check to determine if the last fragment of a FASTA record would be shorter than low
(1kb) in length, and, if so, append that fragment to the final subsequence. It is possible for that last fragment to be up to 16 kb - 1 nt in length, given your parameters:
if new_next > len(sequence) - high:
new_next = len(sequence)
You could change this logic to discard the last fragment (i.e. break
), or simply allow fragments less than 1kb in size, depending on what you need.
EDIT
Here's a slight modification that should allow to you preserve fragment sizes within your bounds.
A cartoon may help illustrate the problem. Let's set up a FASTA record, which has been broken up into n
fragments:
1 2 n-2 n-1 n
|-----|---------|...|-------|---------|-----|
We may get unlucky and sample a random integer for fragment n-1
(frag(n-1)
) which makes the length of frag(n)
less than low
(1 kb in your example).
In my first suggested solution, frag(n-1)
and frag(n)
are appended to one another. This is why you get some fragments longer than high
(15 kb).
Another option is to discard frag(n)
, which is of length less than low
(1 kb). Here's some code:
#!/usr/bin/env python
import sys
import numpy as np
import timeit
low = int(sys.argv[1])
high = int(sys.argv[2])
in_fn = sys.argv[3]
"""
Globals
"""
fragment_lengths = []
"""
Process a record
"""
def process_record(header, sequence, low, high):
next = 0
last = 0
fragment = 1
fragment_sequence = ""
while last < len(sequence) - high:
next = np.random.randint(low, high + 1) + last
fragment_sequence += sequence[last:next]
# Comment the sys.stdout.write and uncomment the fragment_lengths line, if measuring fragment length distributions
sys.stdout.write('{}:{}-{}:{}n{}n'.format(header, last, next, fragment, sequence[last:next]))
#fragment_lengths.append(next - last)
last = next
fragment += 1
"""
Uncomment to compare input sequence with fragments
"""
#sys.stderr.write('In > {}nOut > {}n'.format(sequence, fragment_sequence))
"""
Parse records
"""
def parse_records(in_fn, low, high):
header = ""
sequence = ""
with open(in_fn, 'r') as ifh:
for line in ifh:
if line.startswith('>'):
if len(sequence) > 0:
process_record(header, sequence, low, high)
sequence = ""
header = line.rstrip()
else:
sequence += line.rstrip()
if len(sequence) > 0:
process_record(header, sequence, low, high)
def count_elements(seq):
hist = {}
for i in seq:
hist[i] = hist.get(i, 0) + 1
return hist
def wrapper(func, *args, **kwargs):
def wrapped():
return func(*args, **kwargs)
return wrapped
def main(in_fn, low, high):
"""
Comment below if testing the distribution of fragments
"""
parse_records(in_fn, low, high)
"""
Uncomment to test the distribution of fragments from
running fns on same input 100k times
"""
#wrapped = wrapper(parse_records, in_fn, low, high)
#timeit.timeit(wrapped, number=100000)
#print(count_elements(fragment_lengths))
if __name__ == "__main__":
main(in_fn, low, high)
If frag(n)
is needed, and you need to preserve the uniform distribution (or whatever fragment length distribution you want to model), then rejection sampling might be an option.
With rejection sampling, you keep fragmenting your input sequence, until one of the distributions of lengths gets you a set of fragments that entirely cover the input sequence.
If you need that, follow up and I'll take another look.
Correct answer by Alex Reynolds on August 22, 2021
I've previously written my own script to fragment DNA sequences (input as either fastq or fasta files) into identical-length fragments, with a given overlap. I used this because there was a particular assembler that demanded equal-length sequences, and I was working with nanopore reads, but it's since been useful for other purposes as well.
Because it is most concerned about making lengths identical, there may be some forced overlap when setting the overlap to zero. The overlap will only be exactly zero when the sequence length is an exact multiple of the fragment size. Here's an example use:
normalise_seqlengths.pl -fraglength 25 -overlap 0 oneRead.fq
@A00187:180:HLGNGDSXX:3:1101:8486:1031#0 [150 bp, 6 fragments, 0 overlap] 1:N:0:CGCAAGAA+ACTCCTAA
ATTGCGCAATGACAGCCAGGGGAGT
+
FFFFFFFFFFFFFFFFF:,FFFFFF
@A00187:180:HLGNGDSXX:3:1101:8486:1031#1:25-50 1:N:0:CGCAAGAA+ACTCCTAA
GGGAGCTGTCTGGGGTAACCTGGCT
+
:FFFF:F:FFFFF::FFFFF,::FF
@A00187:180:HLGNGDSXX:3:1101:8486:1031#2:50-75 1:N:0:CGCAAGAA+ACTCCTAA
CTAGGACCAGGTGGAAAAAGCACCT
+
F:FFFFFFFFFFFFFFFFFFFFFFF
@A00187:180:HLGNGDSXX:3:1101:8486:1031#3:75-100 1:N:0:CGCAAGAA+ACTCCTAA
TGGCTGGCCCAGCCCTGTTCTTCTC
+
:FF,FFFFF,:,FFF::F:FFF,:F
@A00187:180:HLGNGDSXX:3:1101:8486:1031#4:100-125 1:N:0:CGCAAGAA+ACTCCTAA
CTGGGGAGGGAAGGGCAGGTCCCCT
+
F:FFFFFFFFF,FFFFF:FFF,F,,
@A00187:180:HLGNGDSXX:3:1101:8486:1031#5:125-150 1:N:0:CGCAAGAA+ACTCCTAA
AATGCCTGAACCACATGGTGGGCCA
+
FFF,F:FFFFFFF:F,,,FFF,F,,
normalise_seqlengths.pl -fraglength 63 -overlap 0 oneRead.fq
@A00187:180:HLGNGDSXX:3:1101:8486:1031#0 [150 bp, 3 fragments, 20 overlap] 1:N:0:CGCAAGAA+ACTCCTAA
ATTGCGCAATGACAGCCAGGGGAGTGGGAGCTGTCTGGGGTAACCTGGCTCTAGGACCAGGTG
+
FFFFFFFFFFFFFFFFF:,FFFFFF:FFFF:F:FFFFF::FFFFF,::FFF:FFFFFFFFFFF
@A00187:180:HLGNGDSXX:3:1101:8486:1031#1:43-106 1:N:0:CGCAAGAA+ACTCCTAA
CCTGGCTCTAGGACCAGGTGGAAAAAGCACCTTGGCTGGCCCAGCCCTGTTCTTCTCCTGGGG
+
FF,::FFF:FFFFFFFFFFFFFFFFFFFFFFF:FF,FFFFF,:,FFF::F:FFF,:FF:FFFF
@A00187:180:HLGNGDSXX:3:1101:8486:1031#2:87-150 1:N:0:CGCAAGAA+ACTCCTAA
CCCTGTTCTTCTCCTGGGGAGGGAAGGGCAGGTCCCCTAATGCCTGAACCACATGGTGGGCCA
+
FFF::F:FFF,:FF:FFFFFFFFF,FFFFF:FFF,F,,FFF,F:FFFFFFF:F,,,FFF,F,,
Answered by gringer on August 22, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP