Bioinformatics Asked by Mike Flynn on April 4, 2021
I’m trying to Blast for exact matches to the sequence: ‘ATTGNNNNGCAAACCA’ in the human transcriptome using NCBI Blast on its ‘refseq_rna’ database. However, when I do a basic query I get "No significant similarity found."
Note the N’s in the middle of the sequence, if I get rid of those N’s, yielding the sequence ‘AGCGGATTGCAAAGCAAACCA’, I get a match in the 3′ UTR of the human MeCP2 gene (which is correct). I don’t understand why adding the N’s here make it not work.
I looked at the help section and I believe this may be due to shorter sequences being less "statistically significant". Statistical significance of output is completely irrelevant to me, so I set ‘Expect Value’ to 100000000000000. However, I still get the same result.
I also tried to modify the HTML of the page to allow me to submit a word size of 4 to the form. This went through, but I still got "No significant similarity found".
Could someone help me out with this? I feel like searching for a short sequence shouldn’t be this hard.
UPDATE: while procrastinating, I worked up the below throwaway code into a more generally applicable (but still bare-bones) script.
ORIGINAL:
I feel like you might want to use something other than BLAST to do this.
BLAST is not really designed to solve this problem, I think it's just easy to use. Also, your N-containing sequence (16nt) is shorter than the version without Ns (21 nt), which can't be correct if you care about "exact" matches.
As a side note, the idea of exact matches is a bit tricky with Ns involved. The whole point of ambiguity codes is that things won't be exact.
And all of this is on top of the short-string problem- most alignment programs won't even return results that don't meet some minimum alignment score, which is exacerbated a bit by the Ns. For how BLAST handles ambiguity codes, you can see section 7.2 here.
Why not use regular expressions instead to search your sequence against the transcriptome?
# python
import re
from Bio import SeqIO
fwd = r'ATTG[ACGT]{4}GCAAACCA'
rev = r'TGGTTTGC[ACGT]{4}CAAT' # reverse complement
# reference file
seqfile = SeqIO.parse("your_reference_file.fa", "fasta") # a transcriptome reference fasta
for seq in seqfile:
# maybe use re.match() to get coords but easier to show it with findall
a = re.findall(fwd, str(seq.seq))
b = re.findall(rev, str(seq.seq))
if len(a) > 0:
print(seq.id, a)
if len(b) > 0:
print(seq.id, b)
When I put this in a file and run it against a test case with ATTGAAAAGCAAACCA
, it works:
$ python exact_match.py
the_contig ['ATTGAAAAGCAAACCA']
Alternately, you could just blast the 16 non-N versions of that sequence and combine them, as you've shown.
Correct answer by Maximilian Press on April 4, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP