Extract sequences from partial Header

Question

I have a library I would like to split them based on their headers. Some are related to RNA and DNA.
The header contains much information, but the most important the presence of DNA or RNA/LTR.., these partial words could be in between or at the beginning.
The point is the knowledge of how to extract sequences from a partial header occur in between the ID
My question is how to use grep or awk to grep the header that has one of these words along with sequences?
Note that The sequences are more than one line.
Or perhaps grep the specific word and ignore what before and after?
>Tigger16a#DNA/TcMar-Tigger DF0000028 TcMar-Tigger **DNA** transposon 
>rnd-4_family-38#SINE/MIR ( Recon Family Size = 20, Final Multiple Alignmen 
>rnd-6_family-31751#LTR/Gypsy ( Recon Family Size = 26, Final Multiple Alignment Size = 22 )
>RNA2558#LTR/ERVL 
>NonDNA1#LINE/I-Jockey 
>DNA5#DNA/TcMar-Tc1

I have tried to create a list of the Required IDs using grep creating the list and extracting the sequences but for some reasons, the output has more sequences than those specified in the DNAID.txt list.
grep -A1000 -w -f DNAID.txt.fa MyLibrary > DNA_Sequence.fa

Mr_Z · Answer

Here is a solution with awk: This is the example fasta file which I used: >Test DNA 1 ACGTAAGGATATAATAC ACGTA AGGATA >Test 2 RNA ACGGT GGTA CGGA >DNA - TEST 3 ACG GATA AGGT You can create a file called fasta_search.awk and insert following code: #!/usr/bin/awk -f BEGIN { hit=0 } { if ($0 ~ /^>/) { if ($0 ~ search) { hit=1; print $0; } else { hit=0; } } else if(hit==1) { print $0 } } Now you can run: àwk -v search="DNA" -f fasta_search.awk my_fastas.fa search is therefore a awk variable which could be any String you are looking for. As result I get this: >Test DNA 1 ACGTAAGGATATAATAC ACGTA AGGATA >DNA - TEST 3 ACG GATA AGGT To get all your keywords, you could run: for i in DNA RNA LTR; do awk -v search="$i" -f fasta_search.awk my_fastas.fa > lib_$i.fasta; done This results in head lib_*: ==> lib_DNA.fasta <== >Test DNA 1 ACGTAAGGATATAATAC ACGTA AGGATA >DNA - TEST 3 ACG GATA AGGT ==> lib_LTR.fasta <== ==> lib_RNA.fasta <== >Test 2 RNA ACGGT GGTA CGGA

geek_y · Answer

You could use Biopython. If its a fasta file, it will be complicated to write fasta output (especially multiline fasta) with grep or awk. Simple solution is to use biopython, so that you can even match for any complex patterns in the fasta header.
from Bio import SeqIO

rna_records  = []
dna_records = []

for seq in SeqIO.parse("in.fa","fasta"):

if "RNA" in seq.id:
        rna_records.append(seq)

elif "DNA" in seq.id:
        dna_records.append(seq)

SeqIO.write(rna_records, "RNA_out.fa","fasta")
SeqIO.write(dna_records, "DNA_out.fa","fasta")

This code needs to be optimised if you are dealing with large sequences like human genome fasta as it holds everything in memory.

Extract sequences from partial Header

2 Answers

Add your own answers!

Ask a Question