Bioinformatics Asked on November 24, 2021
I have a library I would like to split them based on their headers. Some are related to RNA and DNA.
The header contains much information, but the most important the presence of DNA or RNA/LTR.., these partial words could be in between or at the beginning.
The point is the knowledge of how to extract sequences from a partial header occur in between the ID
My question is how to use grep or awk to grep the header that has one of these words along with sequences?
Note that The sequences are more than one line.
Or perhaps grep the specific word and ignore what before and after?
>Tigger16a#DNA/TcMar-Tigger DF0000028 TcMar-Tigger **DNA** transposon
>rnd-4_family-38#SINE/MIR ( Recon Family Size = 20, Final Multiple Alignmen
>rnd-6_family-31751#LTR/Gypsy ( Recon Family Size = 26, Final Multiple Alignment Size = 22 )
>RNA2558#LTR/ERVL
>NonDNA1#LINE/I-Jockey
>DNA5#DNA/TcMar-Tc1
I have tried to create a list of the Required IDs using grep creating the list and extracting the sequences but for some reasons, the output has more sequences than those specified in the DNAID.txt list.
grep -A1000 -w -f DNAID.txt.fa MyLibrary > DNA_Sequence.fa
Here is a solution with awk
:
This is the example fasta file which I used:
>Test DNA 1
ACGTAAGGATATAATAC
ACGTA
AGGATA
>Test 2 RNA
ACGGT
GGTA
CGGA
>DNA - TEST 3
ACG
GATA
AGGT
You can create a file called fasta_search.awk
and insert following code:
#!/usr/bin/awk -f
BEGIN {
hit=0
}
{
if ($0 ~ /^>/)
{
if ($0 ~ search)
{
hit=1;
print $0;
}
else
{
hit=0;
}
}
else if(hit==1)
{
print $0
}
}
Now you can run:
àwk -v search="DNA" -f fasta_search.awk my_fastas.fa
search
is therefore a awk variable which could be any String you are looking for.
As result I get this:
>Test DNA 1
ACGTAAGGATATAATAC
ACGTA
AGGATA
>DNA - TEST 3
ACG
GATA
AGGT
To get all your keywords, you could run:
for i in DNA RNA LTR; do awk -v search="$i" -f fasta_search.awk my_fastas.fa > lib_$i.fasta; done
This results in head lib_*
:
==> lib_DNA.fasta <==
>Test DNA 1
ACGTAAGGATATAATAC
ACGTA
AGGATA
>DNA - TEST 3
ACG
GATA
AGGT
==> lib_LTR.fasta <==
==> lib_RNA.fasta <==
>Test 2 RNA
ACGGT
GGTA
CGGA
Answered by Mr_Z on November 24, 2021
You could use Biopython. If its a fasta file, it will be complicated to write fasta output (especially multiline fasta) with grep
or awk
. Simple solution is to use biopython, so that you can even match for any complex patterns in the fasta header.
from Bio import SeqIO
rna_records = []
dna_records = []
for seq in SeqIO.parse("in.fa","fasta"):
if "RNA" in seq.id:
rna_records.append(seq)
elif "DNA" in seq.id:
dna_records.append(seq)
SeqIO.write(rna_records, "RNA_out.fa","fasta")
SeqIO.write(dna_records, "DNA_out.fa","fasta")
This code needs to be optimised if you are dealing with large sequences like human genome fasta as it holds everything in memory.
Answered by geek_y on November 24, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP