Bioinformatics Asked by GSQ on February 2, 2021
I have an awk
command that identifies 30 pb from two multifasta files.
When I used two input files:
awk
command and blastn
command.Question However, when I increase the number of sequences to 10,000 I get different outputs, any idea why this could happend?
Command-#sequences-matches
awk -100- 1
blastn -100- 1
awk -1000- 73
blastn -1000- 73
awk -10000- 8384
blastn -10000- 8280
awk command:
BEGIN { wid = 30 }
sub(/^>/,"") { hdr=$1; next }
NR == FNR { a[hdr]=$0; next }
{
for ( hdrA in a ) {
strA = a[hdrA]
lgthA = length(strA)
for ( idxA=1; idxA<=(lgthA - wid); idxA++ ) {
substrA = substr(strA,idxA,wid)
if ( index($0, substrA) ) {
printf "[%s,%s]n", hdrA, hdr
break
}
}
}
}
awk -f tst.awk file_with_100seqs file_with_100seqsv2
blastn -query file_with_100seqs -subject file_with_100seqsv2 -out blastn_output_seqs1vs2_100.txt -word_size 30 -perc_identity 100 -outfmt "10 qseqid sseqid"
The input files look like this:
#File1
>1
TGATTGCATAACCACTTAACATCTTGTTTTATCTAAATAAAATTAAGCATGTTATCTTTTTGGGGCACTCCTGGGGCAGTAGATGCCAGTTGTTGATTCAGTATATCTACTTGTGACTGGTTATTATCCCGATTTTTTTAGTTTTAAGGTGTTGACATAGCCATCCATGCTCCATATACTGTATAGACCATCTGAGCGTT
>2
TGGGAAAACAGCATTCAGCGGTGGCTTATTCCTGCTAAGGATGTTGGCCGCATTCATGCTGAGCACAACCTCGACGGCCTGCTGAGGGGCGATTCGGCATCCCGCGCTGCCTTTATGAAGGCAATGGGAGAGGCAGGGCTACGCACCATCAACGAGATGCGACGAACGGACAACCTCCCGCCATTGCCGGGTGGCGATGT
>3
GAAATGGGAACCGCGAACATGCCTGCACATCCGTTTGTGCGACCCGCTTACGATACTCGCGAGGAAGAGGCCGCCAGCGTCGCCATTGCCAGGATGAATCAGGCTATTGATGAGGTATTGAGCAAGTGAATGAAGATAATATCTACGCCTTGCTTTCTCCCCTGGCAGAAGGACGGGTATATCCCTATGTTGCGCCATTA
#File2
>1
TGATTGCATAACCACTTAACATCTTGTTTTATCTAAATAAAATTAAGCATGTTATCTTTTTGGGGCACTCCTGGGGCAGTAGATGCCAGTTGTTGATTCAGTATATCTACTTGTGACTGGTTATTATCCCGATTTTTTTAGTTTTAAGGTGTTGACATAGCCATCCATGCGGGAAGGTGCAGCATAATGTGCTTTGGATT
>2
TGAGTGCCCCATTTGTGAAGCAATAAAGTTCGGGTTCGCGCCAGCGGCAAGCGCCCAGCATGCACCGATTTTTTTAGTTTTAAGGTGTTGACATTAGGTATGTCGGGACTGGTATGCTTTCCTGTGTCGCAGCCCGGCGCGTCTCAATGCAGATTCCCATATCCTGTTCATCCATATACTGTATAGACCATCTGAGCGTT
>3
TACCTGAGCGATCGGTAATTTGCGGATTGAAGACAAAGGTGCAGGAATGAGTTTTTGTACGACCGTATTCGCGCAGCTTTACTTCAATTTTGTGCTGTTTGCTCAGCTTCGTGAAAGAGGCCTGACTTTTTAAAGCATCAATTGCTGGCTGCACAAGATGTATCACCCTGTCGGTTCCTGCCTGGGTTTTCGGCAGGGTG
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP