Similar to the OP of https://www.biostars.org/p/377840/, I would like to programmatically BLAST a sequence to a local database of all WGS assemblies.
Since this isn’t feasible for the average biology lab server (correct me if I am wrong), I plan to use ncbi-acc-download to download all WGS assemblies of the species of interest (not a popular species like E. coli, so it should be feasible). Then, I will create a BLAST database for the downloaded assemblies and BLAST the sequence to it.
How can I find all WGS assemblies accessions of a species?
My current plan is to search the NCBI Assembly database using Entrez and a search term such as
"wgs"[Properties] AND txid1337[orgn:exp].
EDIT: IIUC, this approach might miss some WGS assemblies. See my answer.
I am worried (and thus ask for your help) this isn’t the right approach because there seem to be at least 3 other places in which assemblies can be found:
There seem to be WGS assemblies that can't be found in NCBI Assembly database, e.g.: https://www.ncbi.nlm.nih.gov/nuccore/1779902990. I guess that such assemblies also cannot be found in The assembly_summary.txt files that are described in https://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt.
My current best guess is that each WGS assembly has a "WGS master record" in NCBI Nuccore database. To find all WGS assemblies of a taxon whose uid in NCBI Taxonomy database is
1337, search NCBI Nuccore database using Entrez and the search term
"wgs master"[Properties] AND txid1337[orgn:exp].
Answered by Oren Milman on December 12, 2020
Get help from others!