Bioinformatics Asked by Samer Habash on March 27, 2021
I have a fasta file (gene.fasta)with sequences with names patterns:
>Hsc_gene_9627.t1
ATGGCACGCATTTTCCTCATTCTTTTATTGCTGCACAACATCTGCTGTGCAGCCGCTTCA
TTGCTCATTTTGAATGCCGTTACATTGGAGAAGGATGCTAATGATTATGCCGTTGGCGAT
and I have the Ids in text file (seq.txt) which are not the exact as in the fasta file:
HSC_gene_996
HSC_gene_9734
and some of the names came as
HSC_gene_996|HSC_gene_9734
How can extract the sequences?
I am not experienced in python so please use python for dummies language 🙂
Thank you in advance
For example, the fasta are like these:
gene.fasta:
>Hsc_gene_9627.t1
ATGGCACGCATTTTCCTCATTCTTTTATTGCTGCACA
ACATCTGCTGTGCAGCCGCTTCA
>Hsc_gene_9627.t2
GGGGTTTTCCCC
>Hsc_gene_962.t1
AGTCGTCAGTCAGTAGTCGC
seq.txt:
HSC_gene_9627
HSC_gene_9734
HSC_gene_996|HSC_gene_9734
We use a few modules, read in the fasta:
import re
from Bio import SeqIO
records = list(SeqIO.parse("gene.fasta","fasta"))
Read in the list:
seq = [i.rstrip().lower() for i in open('seq.txt').readlines()]
Here I use two functions, one to write fasta files, and the other to format the header for comparison:
def fasta_out(rec):
return(">"+str(rec.description)+"n"+str(rec.seq))
def format(rec):
return re.sub("[.][^ ]*$","",rec.description).lower()
Define the records to keep:
keep = [fasta_out(i) for i in records if format(i) in seq]
Write them out:
f=open('matched.fasta','w')
f.writelines("n".join(keep))
f.close()
We can check:
open('matched.fasta').readlines()
Out[50]:
['>Hsc_gene_9627.t1n',
'ATGGCACGCATTTTCCTCATTCTTTTATTGCTGCACAACATCTGCTGTGCAGCCGCTTCAn',
'>Hsc_gene_9627.t2n',
'GGGGTTTTCCCC']
Most likely there are some simple while, for loops that does not require modules..
Answered by StupidWolf on March 27, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP