Bioinformatics Asked on November 13, 2020
I have a list of about 4000 PDB IDs and would like to get the actual names of the proteins (e.g. lactate dehydrogenase, cytochrome c). I tried the batch header section at the Protein Databank Download page but it refused to accept my PDB IDs in formats (xxxx or xxxx.pdb, individually or space-separated) that worked in an interactive search for the protein structure.
Any suggestions?
You can use one of the UniProt Protein APIs.
As you said you have your pdb entries in a text file line by line you can, like this example.txt
containing:
1brr
4lzm
2dyi
Using the commandline, you can use a little script like this to download the name, if it is available for the given pdb entry.
while read line;
do
curl -X GET --header 'Accept:application/json' "https://www.ebi.ac.uk/proteins/api/proteins/pdb:$line" |
jq -r '.[].protein.recommendedName.fullName.value' |
sed "s/^/$linet/" >> pdb_names.txt;
done < example.txt;
You need to have curl
, sed
and jq
installed on your system.
This gives you following output in pdb_names.txt
1brr Bacteriorhodopsin
4lzm Endolysin
2dyi Ribosome maturation factor RimM
Update
if you want to speed it up, you can run it with parallel
.
parallel -j 4 'curl -X GET --header "Accept:application/json" "https://www.ebi.ac.uk/proteins/api/proteins/pdb:{}" | jq -r ".[]. .protein.recommendedName.fullName.value" | sed "s/^/{}t/" >> pdb_names_parallel.txt' :::: example.txt
With the -j
option you call how many jobs should run in parallel. The limit of the UniProt API is 200 request per second per user.
Update 7. Nov 2020
To get another info beside the protein name, you need to know how the JSON
response from UniProt looks like.
To get also the scientific name, you can run following command:
parallel -j 4 'curl -X GET --header "Accept:application/json" "https://www.ebi.ac.uk/proteins/api/proteins/pdb:{}" | jq -r ".[] | .protein.recommendedName.fullName.value + " - " + .organism.names[0].value" | sed "s/^/{}t/" >> pdb_names_parallel.txt' :::: example.txt
As result you get this:
1brr Bacteriorhodopsin - Halobacterium salinarum (strain ATCC 700922 / JCM 11081 / NRC-1)
4lzm Endolysin - Enterobacteria phage T4
2dyi Ribosome maturation factor RimM - Thermus thermophilus (strain HB8 / ATCC 27634 / DSM 579)
Correct answer by Mr_Z on November 13, 2020
Assuming you can use R
, have you tried with biomaRt
? For example, using 2bhl
(my PhD lover :D)
library(biomaRt)
ensembl <- useMart("ensembl",dataset="hsapiens_gene_ensembl")
# get list of all available info
filters <- listFilters(ensembl)
attributes <- listAttributes(ensembl)
getBM(attributes=c('hgnc_symbol','ensembl_gene_id','entrezgene_id',
'protein_id','description',"superfamily"),
filters = 'pdb',
values = "2bhl",
mart = ensembl)
Returns
hgnc_symbol ensembl_gene_id entrezgene_id protein_id description
1 G6PD ENSG00000160211 2539 ADO22353 glucose-6-phosphate dehydrogenase [Source:HGNC Symbol;Acc:HGNC:4057]
2 G6PD ENSG00000160211 2539 CAA27309 glucose-6-phosphate dehydrogenase [Source:HGNC Symbol;Acc:HGNC:4057]
3 G6PD ENSG00000160211 2539 AAA63175 glucose-6-phosphate dehydrogenase [Source:HGNC Symbol;Acc:HGNC:4057]
4 G6PD ENSG00000160211 2539 AAA52500 glucose-6-phosphate dehydrogenase [Source:HGNC Symbol;Acc:HGNC:4057]
5 G6PD ENSG00000160211 2539 AAA52501 glucose-6-phosphate dehydrogenase [Source:HGNC Symbol;Acc:HGNC:4057]
6 G6PD ENSG00000160211 2539 CAA39089 glucose-6-phosphate dehydrogenase [Source:HGNC Symbol;Acc:HGNC:4057]
7 ...
I am sure if you look at all the available filters and attributes, you can pinpoint the ID you are looking for.
Answered by fra on November 13, 2020
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP