Bioinformatics Asked on March 31, 2021
I have around 10,000 FASTA files of Influenza A virus.
These files contains sequences of each of the 8 segments of the viral genome and I want to separate these files into different locations based on the content of these FASTA files.
In each FASTA file for each segment,the first line has the segment number, for example.
KM368312.1 Influenza A virus (A/swine/Shandong/01/2009(H1N1)) segment 3 polymerase PA (PA) and PA-X protein (PA-X) genes, complete cds
To be clear I want
I want to ensure all segment 1 sequences are grouped into one folder, and each file is grouped according to its geographic origin. The geographic groupings are mirrored for all 8 segements and each placed into their own directory.
The general idea is:
segment
bitIn Python, this can be done in many ways. Here’s one way:
import os
import re
def find_segment(id):
return re.search(r'segment (d)', id).group(1)
for segment in range(8):
os.mkdir(f'segment-{segment + 1}')
for entry in os.scandir():
if not entry.is_file():
continue
with open(entry.path, 'r') as file:
segment = find_segment(next(file))
os.rename(entry.path, f'segment-{segment}/{entry.path}')
Note the absence of error handling — this code assumes that every operation succeeds; empty files, or files without “segment X
” in their FASTA ID, will trigger an error, as will the usual IO failures (missing permissions, etc.).
But most people would probably use a shell scripting language instead of Python to solve this. For example, here’s something equivalent in Bash (untested):
mkdir segment-{1..8}
for file in *.fa *.fasta; do
segment=$(sed 's/.*segment ([[:digit:]]).*/1/; q' "$file")
if [[ -n $segment ]]; then
mv "$file" "segment-$segment"
done
done
Correct answer by Konrad Rudolph on March 31, 2021
One "out there" approach you could take is to create a fasta file containing your regions of interest (as separate sequences), and then map your fasta files to that reference fasta file, generating BAM/SAM output. After that, position-sort the BAM file, and you've now got a content-sorted list of sequences.
The process to do this would be something like this:
minimap2 -a regions.fasta input1.fasta input2.fasta ... | samtools sort > mapped.bam
samtools fastq mapped.bam > sorted.fasta
Answered by gringer on March 31, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP