How to sort multiple FASTA files based on their content?

Bioinformatics Asked on March 31, 2021

I have around 10,000 FASTA files of Influenza A virus.

These files contains sequences of each of the 8 segments of the viral genome and I want to separate these files into different locations based on the content of these FASTA files.

In each FASTA file for each segment,the first line has the segment number, for example.

KM368312.1 Influenza A virus (A/swine/Shandong/01/2009(H1N1)) segment 3 polymerase PA (PA) and PA-X protein (PA-X) genes, complete cds

To be clear I want

  • file1.txt has segment 1
  • file2.txt has segment 2
    etc …

I want to ensure all segment 1 sequences are grouped into one folder, and each file is grouped according to its geographic origin. The geographic groupings are mirrored for all 8 segements and each placed into their own directory.

2 Answers

The general idea is:

  1. iterate over all files
    1. read the first line of each file, extract the segment bit
    2. move the file based on that

In Python, this can be done in many ways. Here’s one way:

import os
import re

def find_segment(id):
    return'segment (d)', id).group(1)

for segment in range(8):
    os.mkdir(f'segment-{segment + 1}')

for entry in os.scandir():
    if not entry.is_file():

    with open(entry.path, 'r') as file:
        segment = find_segment(next(file))
    os.rename(entry.path, f'segment-{segment}/{entry.path}')

Note the absence of error handling — this code assumes that every operation succeeds; empty files, or files without “segment X” in their FASTA ID, will trigger an error, as will the usual IO failures (missing permissions, etc.).

But most people would probably use a shell scripting language instead of Python to solve this. For example, here’s something equivalent in Bash (untested):

mkdir segment-{1..8}

for file in *.fa *.fasta; do
    segment=$(sed 's/.*segment ([[:digit:]]).*/1/; q' "$file")
    if [[ -n $segment ]]; then
    mv "$file" "segment-$segment"

Correct answer by Konrad Rudolph on March 31, 2021

One "out there" approach you could take is to create a fasta file containing your regions of interest (as separate sequences), and then map your fasta files to that reference fasta file, generating BAM/SAM output. After that, position-sort the BAM file, and you've now got a content-sorted list of sequences.

The process to do this would be something like this:

minimap2 -a regions.fasta input1.fasta input2.fasta ... | samtools sort > mapped.bam
samtools fastq mapped.bam > sorted.fasta

Answered by gringer on March 31, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP