Bioinformatics Asked by juniper- on April 25, 2021
Is there a paper or web page describing the procedure for creating the nr database used by NCBI’s BLAST implementation?
I presume it’s some type of clustering, but I’m curious about how exactly sequences are condensed into non-redundant representatives.
Did a little more searching and found the answer in the README on BLAST's ftp site: ftp://ftp.ncbi.nlm.nih.gov/blast/db/README
6. Non-redundant defline syntax
The non-redundant databases are nr, nt and pataa. Identical sequences are
merged into one entry in these databases. To be merged two sequences must
have identical lengths and every residue at every position must be the
same. The FASTA deflines for the different entries that belong to one
record are separated by control-A characters invisible to most
programs. In the example below both entries Q57293.1 and AAB05030.1
have the same sequence, in every respect:
>Q57293.1 RecName: Full=Fe(3+) ions import ATP-binding protein FbpC ^AAAB05030.1 afuC
[Actinobacillus pleuropneumoniae] ^AAAB17216.1 afuC [Actinobacillus pleuropneumoniae]
MNNDFLVLKNITKSFGKATVIDNLDLVIKRGTMVTLLGPSGCGKTTVLRLVAGLENPTSGQIFIDGEDVTKSSIQNRDIC
IVFQSYALFPHMSIGDNVGYGLRMQGVSNEERKQRVKEALELVDLAGFADRFVDQISGGQQQRVALARALVLKPKVLILD
EPLSNLDANLRRSMREKIRELQQRLGITSLYVTHDQTEAFAVSDEVIVMNKGTIMQKARQKIFIYDRILYSLRNFMGEST
ICDGNLNQGTVSIGDYRFPLHNAADFSVADGACLVGVRPEAIRLTATGETSQRCQIKSAVYMGNHWEIVANWNGKDVLIN
ANPDQFDPDATKAFIHFTEQGIFLLNKE
Individual sequences are now identifed simply by their accession.version.
For databases whose entries are not from official NCBI sequence databases,
such as Trace database, the gnl| convention is used. For custom databases,
this convention should be followed and the id for each sequence must be
unique, if one would like to take the advantage of indexed database,
which enables specific sequence retrieval using blastdbcmd program included
in the blast executable package. One should refer to documents
distributed in the standalone BLAST package for more details.
Landed on that README from this question on biostars.org: https://www.biostars.org/p/217456/
Edit
In that same README file is some information on the origin of the sequences in the non-redundant sets:
+-----------------------+-----------------------------------------------------+
|File Name | Content Description |
+-----------------------+-----------------------------------------------------+
nr.gz* | non-redundant protein sequence database with entries
from GenPept, Swissprot, PIR, PDF, PDB, and RefSeq
nt.gz* | nucleotide sequence database, with entries from all
traditional divisions of GenBank, EMBL, and DDBJ;
excluding bulk divisions (gss, sts, pat, est, htg)
and wgs entries. Partially non-redundant.
Correct answer by juniper- on April 25, 2021
The Refseq team and also the NCBI resource coordinators team publish a new paper every few years, so check out the many papers (e.g. here or here), but to answer your 2nd question, non-redundancy here is (I think) defined very strictly as proteins that are identical in terms of sequence and length, so the clustering is trivial, without the need for a sophisticated clustering algorithm as required to detect more remote homologs.
Answered by Chris_Rands on April 25, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP