TransWikia.com

Extract sequences from partial Header

Bioinformatics Asked on November 24, 2021

I have a library I would like to split them based on their headers. Some are related to RNA and DNA.

The header contains much information, but the most important the presence of DNA or RNA/LTR.., these partial words could be in between or at the beginning.

The point is the knowledge of how to extract sequences from a partial header occur in between the ID

My question is how to use grep or awk to grep the header that has one of these words along with sequences?
Note that The sequences are more than one line.

Or perhaps grep the specific word and ignore what before and after?

>Tigger16a#DNA/TcMar-Tigger DF0000028 TcMar-Tigger **DNA** transposon 
>rnd-4_family-38#SINE/MIR ( Recon Family Size = 20, Final Multiple Alignmen 
>rnd-6_family-31751#LTR/Gypsy ( Recon Family Size = 26, Final Multiple Alignment Size = 22 )
>RNA2558#LTR/ERVL 
>NonDNA1#LINE/I-Jockey 
>DNA5#DNA/TcMar-Tc1

I have tried to create a list of the Required IDs using grep creating the list and extracting the sequences but for some reasons, the output has more sequences than those specified in the DNAID.txt list.

grep -A1000 -w -f DNAID.txt.fa MyLibrary > DNA_Sequence.fa

2 Answers

Here is a solution with awk:

This is the example fasta file which I used:

>Test DNA 1
ACGTAAGGATATAATAC
ACGTA
AGGATA
>Test 2 RNA 
ACGGT
GGTA
CGGA
>DNA - TEST 3
ACG
GATA
AGGT

You can create a file called fasta_search.awk and insert following code:

#!/usr/bin/awk -f
BEGIN {
    hit=0
}
{
    if ($0 ~ /^>/) 
    {   
        if ($0 ~ search)
        {
            hit=1;
            print $0; 
        }
        else 
        {
            hit=0; 
        }
     }   
    else if(hit==1)
    {   
        print $0
    }   
}

Now you can run: àwk -v search="DNA" -f fasta_search.awk my_fastas.fa

search is therefore a awk variable which could be any String you are looking for.

As result I get this:

>Test DNA 1
ACGTAAGGATATAATAC
ACGTA
AGGATA
>DNA - TEST 3
ACG
GATA
AGGT

To get all your keywords, you could run:

for i in DNA RNA LTR; do awk -v search="$i" -f fasta_search.awk my_fastas.fa > lib_$i.fasta; done

This results in head lib_*:

==> lib_DNA.fasta <==
>Test DNA 1
ACGTAAGGATATAATAC
ACGTA
AGGATA
>DNA - TEST 3
ACG
GATA
AGGT

==> lib_LTR.fasta <==

==> lib_RNA.fasta <==
>Test 2 RNA
ACGGT
GGTA
CGGA

Answered by Mr_Z on November 24, 2021

You could use Biopython. If its a fasta file, it will be complicated to write fasta output (especially multiline fasta) with grep or awk. Simple solution is to use biopython, so that you can even match for any complex patterns in the fasta header.

from Bio import SeqIO

rna_records  = []
dna_records = []

for seq in SeqIO.parse("in.fa","fasta"):

    if "RNA" in seq.id:
        rna_records.append(seq)

    elif "DNA" in seq.id:
        dna_records.append(seq)

SeqIO.write(rna_records, "RNA_out.fa","fasta")
SeqIO.write(dna_records, "DNA_out.fa","fasta")

This code needs to be optimised if you are dealing with large sequences like human genome fasta as it holds everything in memory.

Answered by geek_y on November 24, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP