Extract residue sequence from pdb file (in biopython, but open to recommendations)

Question

I'm new to Biopython and I'd like to extract the sequence of residues from a pdb file.
My two questions are:

What is the simplest way to do this? (Esp. when there is more than one sequence)
and
Should I be worried about "PDBConstructionWarning: WARNING: Chain B is discontinuous"?

So far, I've obtained the residue sequence via:
p = PDBParser()
structure = p.get_structure("1ppi", "1ppi.pdb")
ppb=PPBuilder()
for pp in ppb.build_peptides(structure):
     print(pp.get_sequence())

seq = pp.get_sequence().__str__()

This seems to work well for this molecule. However, is there an easier way, especially when there is more than one sequence?
For example, I've read that one can also do
res_list = Bio.PDB.Selection.unfold_entities(structure, 'R')

but res_list is not a sequence of residues in str, and I don't know how to convert the output from res_list to a str sequence.
In addition (or perhaps because I'm going through the PPBuilder), I've recently gotten a lot of warnings of the type:
/usr/local/lib/python3.8/site-packages/Bio/PDB/StructureBuilder.py:89: PDBConstructionWarning: WARNING: Chain A is discontinuous at line..
(For example, with structure = p.get_structure("5owu", "5owu.pdb"))
I've seen a discussion about how to silence such warnings, but should I be worried about this? I've also noticed that when I get these warnings, pp builder seems to give me more sequences then are there.

Arap · Accepted Answer

There are other ways to do it (many and some might be easier), but if you want to do it with BioPython PDB module you can iterate the residues like this:
 # You can use a dict to convert three letter code to one letter code
 d3to1 = {'CYS': 'C', 'ASP': 'D', 'SER': 'S', 'GLN': 'Q', 'LYS': 'K',
 'ILE': 'I', 'PRO': 'P', 'THR': 'T', 'PHE': 'F', 'ASN': 'N', 
 'GLY': 'G', 'HIS': 'H', 'LEU': 'L', 'ARG': 'R', 'TRP': 'W', 
 'ALA': 'A', 'VAL':'V', 'GLU': 'E', 'TYR': 'Y', 'MET': 'M'}

# Just an example input pdb
 record = '1pa2.pdb'

# run parser
 parser = PDBParser(QUIET=True)
 structure = parser.get_structure('struct', record)

# iterate each model, chain, and residue
 # printing out the sequence for each chain

for model in structure:
     for chain in model:
         seq = []
         for residue in chain:
             seq.append(d3to1[residue.resname])
         print('>some_headern',''.join(seq))

Not really, is very common, you can turn it off like this (taken from https://lists.open-bio.org/pipermail/biopython/2014-July/015371.html):
 import warnings
 from Bio import PDBConstructionWarning
 #your code
 with warnings.catch_warnings():
 warnings.simplefilter('ignore', PDBConstructionWarning)
 #your code which might trigger the warning
 #rest of your code here

However, because this is so common, you can just use the
 QUIET=True option on the PDBParser object:

from Bio.PDB.PDBParser import PDBParser
 struct = PDBParser(QUIET=True).get_structure('tmp', pdbf)

Try help(PDBParser) for more, or see:
http://biopython.org/DIST/docs/api/Bio.PDB.PDBParser%27.PDBParser-class.html#__init__

Matteo Ferla · Answer

The discontinuity is normal: it is a missing stretch due to lack of density. In PyMOL you see them as dotted lines.
I am sorry to say, the easiest/safest way to get the sequences is not using the PDB files...
import requests
data = requests.get(f'https://www.ebi.ac.uk/pdbe/api/pdb/entry/molecules/{code}').json()[code.lower()]
print(data[0]['sequence'])

One gets back a lot of information: example. This will tell you when you have multiple copies of the same peptide. The only information you do not get is what the PDB numbering is relative to the canonical Uniprot for that chain, which has to be taken from SIFTS.
A side note... Biopython.PDB is one of the best BioPython packages, but I personally I much prefer PyMOL as a python module (pymol2 module) to Biopython.PDB as it is functionally more complete and straightforward even if everything operates in a stage. You might want to check it out.
with pymol2.PyMOL() as pymol:
    pymol.cmd.fetch('1UBQ', 'prot')
    print(pymol.cmd.get_fastastr('prot'))

pippo1980 · Answer

you can try PDBx Python Parser too:
http://mmcif.wwpdb.org/docs/sw-examples/python/html/index.html
it has an example showing exactly what you need
http://mmcif.wwpdb.org/docs/sw-examples/python/html/fasta.html

cminima · Answer

You can also use the SeqIO module from Biopython https://biopython.org/docs/1.75/api/Bio.SeqIO.PdbIO.html
and do something like this:
from Bio import SeqIO

for record in SeqIO.parse(target, "pdb-atom"):
    print(record.seq)

where target is the path to your pdb file
the parser has multiple input/output functions, and pdb-atom is the one that will read out of the coordinates itself.

Extract residue sequence from pdb file (in biopython, but open to recommendations)

4 Answers

Add your own answers!

Ask a Question