Bioinformatics Asked by Mauri1313 on April 27, 2021
I have a text file that contains a list of IDs (314 sequences):
AVP78031.1
AVP78042.1
ATO98108.1
ATO98120.1
ATO98132.1
...
My goal is to make a script (maybe using Python or Perl) to check in the list if all the IDs are nucleotide or protein sequences.
For example:
AVP78031.1 -> protein (this is a nucleotide sequence, I change nucleotide for protein to show an example).
AVP78042.1 -> nucleotide
ATO98108.1 -> nucleotide
ATO98120.1 -> nucleotide
ATO98132.1 -> nucleotide
Any idea to do a script?
Thank everybody!
If these are all GenBank or RefSeq accessions, you can use Entrez Direct for this as shown below:
$ cat accs.txt
ATO98108.1
ATO98120.1
ATO98132.1
AVP78031.1
AVP78042.1
$ cat accs.txt | epost -db nuccore | efetch -format acc
## no output because none of them are nucleotide accessions
$ cat accs.txt | epost -db protein -format acc | efetch -format acc
AVP78042.1
AVP78031.1
ATO98132.1
ATO98120.1
ATO98108.1
NOTE: This will work only if the accessions are currently live because epost
does not find any suppressed accessions. For example:
$ cat accs.txt
NM_002826.3
NM_002826.4
NM_002826.5
$ cat accs.txt | epost -db nuccore -format acc | efetch -format acc
NM_002826.5
Here, all three accessions are valid nucleotide accessions but only the last one, NM_002826.5, is alive.
An alternate way is to use the accession prefixes defined here and come up with an appropriate regular expression query.
Correct answer by vkkodali on April 27, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP