TransWikia.com

Why do I obtain different output results with blast vs awk commands

Bioinformatics Asked by GSQ on February 2, 2021

I have an awk command that identifies 30 pb from two multifasta files.

When I used two input files:

  • E.g. 100 sequences each, I get the same result with the awk command and blastn command.
  • E.g. with 1,000 sequences each I get the same result with both commands.

Question However, when I increase the number of sequences to 10,000 I get different outputs, any idea why this could happend?

Command-#sequences-matches
awk      -100-       1
blastn   -100-       1
awk     -1000-      73
blastn  -1000-      73
awk     -10000-     8384
blastn  -10000-     8280

awk command:

BEGIN { wid = 30 }

sub(/^>/,"") { hdr=$1; next }

NR == FNR { a[hdr]=$0; next }

{

    for ( hdrA in a ) {

        strA  = a[hdrA]

        lgthA = length(strA)

        for ( idxA=1; idxA<=(lgthA - wid); idxA++ ) {

            substrA = substr(strA,idxA,wid)

            if ( index($0, substrA) ) {

                printf "[%s,%s]n", hdrA, hdr

                break
            }
        }
    }
}

awk -f tst.awk file_with_100seqs file_with_100seqsv2

  1. blastn command:

blastn -query file_with_100seqs -subject file_with_100seqsv2 -out blastn_output_seqs1vs2_100.txt -word_size 30 -perc_identity 100 -outfmt "10 qseqid sseqid"

The input files look like this:

#File1

>1  
TGATTGCATAACCACTTAACATCTTGTTTTATCTAAATAAAATTAAGCATGTTATCTTTTTGGGGCACTCCTGGGGCAGTAGATGCCAGTTGTTGATTCAGTATATCTACTTGTGACTGGTTATTATCCCGATTTTTTTAGTTTTAAGGTGTTGACATAGCCATCCATGCTCCATATACTGTATAGACCATCTGAGCGTT  
>2  
TGGGAAAACAGCATTCAGCGGTGGCTTATTCCTGCTAAGGATGTTGGCCGCATTCATGCTGAGCACAACCTCGACGGCCTGCTGAGGGGCGATTCGGCATCCCGCGCTGCCTTTATGAAGGCAATGGGAGAGGCAGGGCTACGCACCATCAACGAGATGCGACGAACGGACAACCTCCCGCCATTGCCGGGTGGCGATGT  
>3  
GAAATGGGAACCGCGAACATGCCTGCACATCCGTTTGTGCGACCCGCTTACGATACTCGCGAGGAAGAGGCCGCCAGCGTCGCCATTGCCAGGATGAATCAGGCTATTGATGAGGTATTGAGCAAGTGAATGAAGATAATATCTACGCCTTGCTTTCTCCCCTGGCAGAAGGACGGGTATATCCCTATGTTGCGCCATTA

#File2

>1  
TGATTGCATAACCACTTAACATCTTGTTTTATCTAAATAAAATTAAGCATGTTATCTTTTTGGGGCACTCCTGGGGCAGTAGATGCCAGTTGTTGATTCAGTATATCTACTTGTGACTGGTTATTATCCCGATTTTTTTAGTTTTAAGGTGTTGACATAGCCATCCATGCGGGAAGGTGCAGCATAATGTGCTTTGGATT  
>2  
TGAGTGCCCCATTTGTGAAGCAATAAAGTTCGGGTTCGCGCCAGCGGCAAGCGCCCAGCATGCACCGATTTTTTTAGTTTTAAGGTGTTGACATTAGGTATGTCGGGACTGGTATGCTTTCCTGTGTCGCAGCCCGGCGCGTCTCAATGCAGATTCCCATATCCTGTTCATCCATATACTGTATAGACCATCTGAGCGTT  
>3  
TACCTGAGCGATCGGTAATTTGCGGATTGAAGACAAAGGTGCAGGAATGAGTTTTTGTACGACCGTATTCGCGCAGCTTTACTTCAATTTTGTGCTGTTTGCTCAGCTTCGTGAAAGAGGCCTGACTTTTTAAAGCATCAATTGCTGGCTGCACAAGATGTATCACCCTGTCGGTTCCTGCCTGGGTTTTCGGCAGGGTG  

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP