Bioinformatics Asked by mzzx on April 11, 2021
I’m new to Biopython and I’d like to extract the sequence of residues from a pdb file.
My two questions are:
So far, I’ve obtained the residue sequence via:
p = PDBParser()
structure = p.get_structure("1ppi", "1ppi.pdb")
ppb=PPBuilder()
for pp in ppb.build_peptides(structure):
print(pp.get_sequence())
seq = pp.get_sequence().__str__()
This seems to work well for this molecule. However, is there an easier way, especially when there is more than one sequence?
For example, I’ve read that one can also do
res_list = Bio.PDB.Selection.unfold_entities(structure, 'R')
but res_list is not a sequence of residues in str, and I don’t know how to convert the output from res_list to a str sequence.
In addition (or perhaps because I’m going through the PPBuilder), I’ve recently gotten a lot of warnings of the type:
/usr/local/lib/python3.8/site-packages/Bio/PDB/StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain A is discontinuous at line..
(For example, with structure = p.get_structure("5owu", "5owu.pdb")
)
I’ve seen a discussion about how to silence such warnings, but should I be worried about this? I’ve also noticed that when I get these warnings, pp builder seems to give me more sequences then are there.
There are other ways to do it (many and some might be easier), but if you want to do it with BioPython PDB module you can iterate the residues like this:
# You can use a dict to convert three letter code to one letter code
d3to1 = {'CYS': 'C', 'ASP': 'D', 'SER': 'S', 'GLN': 'Q', 'LYS': 'K',
'ILE': 'I', 'PRO': 'P', 'THR': 'T', 'PHE': 'F', 'ASN': 'N',
'GLY': 'G', 'HIS': 'H', 'LEU': 'L', 'ARG': 'R', 'TRP': 'W',
'ALA': 'A', 'VAL':'V', 'GLU': 'E', 'TYR': 'Y', 'MET': 'M'}
# Just an example input pdb
record = '1pa2.pdb'
# run parser
parser = PDBParser(QUIET=True)
structure = parser.get_structure('struct', record)
# iterate each model, chain, and residue
# printing out the sequence for each chain
for model in structure:
for chain in model:
seq = []
for residue in chain:
seq.append(d3to1[residue.resname])
print('>some_headern',''.join(seq))
Not really, is very common, you can turn it off like this (taken from https://lists.open-bio.org/pipermail/biopython/2014-July/015371.html):
import warnings
from Bio import PDBConstructionWarning
#your code
with warnings.catch_warnings():
warnings.simplefilter('ignore', PDBConstructionWarning)
#your code which might trigger the warning
#rest of your code here
However, because this is so common, you can just use the
QUIET=True option on the PDBParser object:
from Bio.PDB.PDBParser import PDBParser
struct = PDBParser(QUIET=True).get_structure('tmp', pdbf)
Try help(PDBParser) for more, or see: http://biopython.org/DIST/docs/api/Bio.PDB.PDBParser%27.PDBParser-class.html#__init__
Correct answer by Arap on April 11, 2021
The discontinuity is normal: it is a missing stretch due to lack of density. In PyMOL you see them as dotted lines.
I am sorry to say, the easiest/safest way to get the sequences is not using the PDB files...
import requests
data = requests.get(f'https://www.ebi.ac.uk/pdbe/api/pdb/entry/molecules/{code}').json()[code.lower()]
print(data[0]['sequence'])
One gets back a lot of information: example. This will tell you when you have multiple copies of the same peptide. The only information you do not get is what the PDB numbering is relative to the canonical Uniprot for that chain, which has to be taken from SIFTS.
A side note... Biopython.PDB is one of the best BioPython packages, but I personally I much prefer PyMOL as a python module (pymol2
module) to Biopython.PDB as it is functionally more complete and straightforward even if everything operates in a stage. You might want to check it out.
with pymol2.PyMOL() as pymol:
pymol.cmd.fetch('1UBQ', 'prot')
print(pymol.cmd.get_fastastr('prot'))
Answered by Matteo Ferla on April 11, 2021
you can try PDBx Python Parser too:
http://mmcif.wwpdb.org/docs/sw-examples/python/html/index.html
it has an example showing exactly what you need
http://mmcif.wwpdb.org/docs/sw-examples/python/html/fasta.html
Answered by pippo1980 on April 11, 2021
You can also use the SeqIO module from Biopython https://biopython.org/docs/1.75/api/Bio.SeqIO.PdbIO.html
and do something like this:
from Bio import SeqIO
for record in SeqIO.parse(target, "pdb-atom"):
print(record.seq)
where target is the path to your pdb file the parser has multiple input/output functions, and pdb-atom is the one that will read out of the coordinates itself.
Answered by cminima on April 11, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP