Bioinformatics Asked by Edwardo on May 29, 2021
I am working on a project using a fasta file. I am writing my command in nano within command-line and executing using python, also within my command-line.
I would like my command to provide me with a tab delimited file with three columns: first column should contain my sequence name, second column should provide me with my sequence length, and the third column should show the sequence itself.
I have written the following command so far within nano:
from Bio import SeqIO
import sys
for hello_fasta in SeqIO.parse(sys.argv[1], "fasta"):
list = hello_fasta.split("t")
print hello_fasta.description
print (len(hello_fasta.seq))
For example, I would like my command to provide me with the desired output and with the following order: Gene name ; Gene length ; Gene seq
H0192X 26 FORUWOHRPPTRWFAWWEAKJNFWEJ
You can use a list and insert()
to add an element in a specific order, then expand the list with *
. Or you can use join()
.
from Bio import SeqIO
import sys
for hello_fasta in SeqIO.parse(sys.argv[1], "fasta"):
sequences = []
sequences.insert(0, hello_fasta.description)
sequences.insert(1, len(hello_fasta.seq))
sequences.insert(2, hello_fasta.seq)
# option 1
print(*sequences, sep='t')
# option 2
print('t'.join(map(str, sequences)))
Answered by zorbax on May 29, 2021
Here's a solution using pandas
if you want to save the tsv:
from Bio import SeqIO
import pandas as pd
from io import StringIO
example = """
>seq0
FQTWEEFSRAAEKLYLADPMKVRVVLKYRHVDGNLCIKVTDDLVCLVYRTDQAQDVKKIEKF
>seq1
KYRTWEEFTRAAEKLYQADPMKVRVVLKYRHCDGNLCIKVTDDVVCLLYRTDQAQDVKKIEKFHSQLMRLMELKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM
>seq2
EEYQTWEEFARAAEKLYLTDPMKVRVVLKYRHCDGNLCMKVTDDAVCLQYKTDQAQDVKKVEKLHGK
>seq3
MYQVWEEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVCLQYKTDQAQDVK
>seq4
EEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVVSYEMRLFGVQKDNFALEHSLL
>seq5
SWEEFAKAAEVLYLEDPMKCRMCTKYRHVDHKLVVKLTDNHTVLKYVTDMAQDVKKIEKLTTLLMR
>seq6
FTNWEEFAKAAERLHSANPEKCRFVTKYNHTKGELVLKLTDDVVCLQYSTNQLQDVKKLEKLSSTLLRSI
>seq7
SWEEFVERSVQLFRGDPNATRYVMKYRHCEGKLVLKVTDDRECLKFKTDQAQDAKKMEKLNNIFF
>seq8
SWDEFVDRSVQLFRADPESTRYVMKYRHCDGKLVLKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM
>seq9
KNWEDFEIAAENMYMANPQNCRYTMKYVHSKGHILLKMSDNVKCVQYRAENMPDLKK
>seq10
FDSWDEFVSKSVELFRNHPDTTRYVVKYRHCEGKLVLKVTDNHECLKFKTDQAQDAKKMEK
"""
# This example just happens to be a string, just load your
# fasta file using the method you're already using
example_records = SeqIO.parse( StringIO(example), 'fasta')
# Dictionary to hold the data you eventually want in the tsv
data = {"Gene name" : list(),
"Gene length" : list(),
"Gene seq" : list()}
# Append the necessary into the data dictionary
for record in example_records:
data['Gene name'].append(record.description)
data['Gene length'].append(len(record.seq))
data['Gene seq'].append(str(record.seq))
# Convert your data into a pandas DataFrame and save as a tsv
gene_df = pd.DataFrame(data)
gene_df.to_csv("gene_info.tsv", sep = 't', index = False)
This results in a tsv that looks like this:
$ head gene_info.tsv
Gene name Gene length Gene seq
seq0 62 FQTWEEFSRAAEKLYLADPMKVRVVLKYRHVDGNLCIKVTDDLVCLVYRTDQAQDVKKIEKF
seq1 106 KYRTWEEFTRAAEKLYQADPMKVRVVLKYRHCDGNLCIKVTDDVVCLLYRTDQAQDVKKIEKFHSQLMRLMELKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM
seq2 67 EEYQTWEEFARAAEKLYLTDPMKVRVVLKYRHCDGNLCMKVTDDAVCLQYKTDQAQDVKKVEKLHGK
seq3 58 MYQVWEEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVCLQYKTDQAQDVK
seq4 62 EEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVVSYEMRLFGVQKDNFALEHSLL
seq5 66 SWEEFAKAAEVLYLEDPMKCRMCTKYRHVDHKLVVKLTDNHTVLKYVTDMAQDVKKIEKLTTLLMR
seq6 70 FTNWEEFAKAAERLHSANPEKCRFVTKYNHTKGELVLKLTDDVVCLQYSTNQLQDVKKLEKLSSTLLRSI
seq7 65 SWEEFVERSVQLFRGDPNATRYVMKYRHCEGKLVLKVTDDRECLKFKTDQAQDAKKMEKLNNIFF
seq8 68 SWDEFVDRSVQLFRADPESTRYVMKYRHCDGKLVLKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM
Hopefully this helps!
Answered by Robert Link on May 29, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP