How to sort multiple FASTA files based on their content?

Question

I have around 10,000 FASTA files of Influenza A virus.
These files contains sequences of each of the 8 segments of the viral genome and I want to separate these files into different locations based on the content of these FASTA files.
In each FASTA file for each segment,the first line has the segment number, for example.

KM368312.1 Influenza A virus (A/swine/Shandong/01/2009(H1N1)) segment 3 polymerase PA (PA) and PA-X protein (PA-X) genes, complete cds

To be clear I want

file1.txt has segment 1
file2.txt has segment 2
etc ...

I want to ensure all segment 1 sequences are grouped into one folder, and each file is grouped according to its geographic origin. The geographic groupings are mirrored for all 8 segements and each placed into their own directory.

Konrad Rudolph · Accepted Answer

The general idea is:

iterate over all files

read the first line of each file, extract the segment bit
move the file based on that

In Python, this can be done in many ways. Here’s one way:
import os
import re

def find_segment(id):
    return re.search(r'segment (d)', id).group(1)

for segment in range(8):
    os.mkdir(f'segment-{segment + 1}')

for entry in os.scandir():
    if not entry.is_file():
        continue

with open(entry.path, 'r') as file:
        segment = find_segment(next(file))
    os.rename(entry.path, f'segment-{segment}/{entry.path}')

Note the absence of error handling — this code assumes that every operation succeeds; empty files, or files without “segment X” in their FASTA ID, will trigger an error, as will the usual IO failures (missing permissions, etc.).

But most people would probably use a shell scripting language instead of Python to solve this. For example, here’s something equivalent in Bash (untested):
mkdir segment-{1..8}

for file in *.fa *.fasta; do
    segment=$(sed 's/.*segment ([[:digit:]]).*/1/; q' "$file")
    if [[ -n $segment ]]; then
    mv "$file" "segment-$segment"
    done
done

gringer · Answer

One "out there" approach you could take is to create a fasta file containing your regions of interest (as separate sequences), and then map your fasta files to that reference fasta file, generating BAM/SAM output. After that, position-sort the BAM file, and you've now got a content-sorted list of sequences.
The process to do this would be something like this:
minimap2 -a regions.fasta input1.fasta input2.fasta ... | samtools sort > mapped.bam
samtools fastq mapped.bam > sorted.fasta

How to sort multiple FASTA files based on their content?

2 Answers

Add your own answers!

Ask a Question