Bioinformatics Asked on September 4, 2021
I have a fasta file containing the sequence of a gene across different species. In total there are around 900 samples and 12 species. (Each sequences is over multiple lines and longer than 100bp.)
My fasta file looks like:
>Species-1-samplenameA
CTATCCTTAAACGCATATCTCGCACAGTAACTCCCCAATATGTGAGCATCTGATGTTGCCCGGGCCGAGTTAGTCTTGTGCTCACGGAACTTATTGTATG
>Species-2-samplenameB
AGTAGTGATTTGAAAGAGTTGTCAGTTAGCTCGTTCAGGTAATGGTTCCTCACACTACGTCAAAATAAGAGAGCGGTCGTGACATTATCCGTGATTTTCT
>Species-3-samplenameC
CACTACTATCAGTACTCACGACTCGATTCTGCCGCAGCCACGTATCGCCAGAAAGCCAGTCAGCATTAAGGAGTGCTCTGGGCAGGACAACTCGCATAGT
>Species-3-samplenameD
GAGAGTTACATGTTCGTTGGGCTCTTCCGACACGAACCTCAGTTGGCCTACATCCTACCTGAGGTCTGTGCCCCGGTGGTGAGAAGTGCGCATTTCGTTC
I want to split this file in one fasta file per species.
I think it’s possible to use the awk function for this but I’m stuck. Does anyone have a script/code that might help me?
Thanks a lot.
A simple Biopython solution- iterate over the sequence records, identify the species, open the file handle using append mode to ensure no data is overwritten, and write the record:
from Bio import SeqIO
for record in SeqIO.parse("myfile.fa", "fasta"):
species = record.id.split('_')[0]
with open(f"{species}.fa", "a") as f:
SeqIO.write(record, f, "fasta")
Correct answer by Chris_Rands on September 4, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP