TransWikia.com

How to get matching pattern along with ID in a single command in grep?

Bioinformatics Asked by 20 21 on June 7, 2021

I have a file containing multiple fast sequneces. For a specific consensus pattern as input, I extracted all the matching patterns from target fasta sequences with

grep -o -E "CC[GT]AAA[GC][AC]TT[GC]" input.fasta

However, the above command will retrieve just the matching sequences from fasta sequences and I also wanted to get the corresponding fasta header of each match.

For example, if input.fasta file is something like this,

>Gene 1
TGATGAAAAATGATAGAT
ATTGGGGGAAAAAAAAAT

>Gene 2
TTTCCTAAAGATTGT
AAATTTAAAAATGTTTTT

(Gene 2 has matching subsequence CCTAAAGATTG)

Output:

CCTAAAGATTG   Gene2

I prefer a solution with grep. But other possible solutions also helpful.

One Answer

Use this Perl one-liner:

perl -lne '$id = $1 if /^>(.+)/; ($m) = /(CC[GT]AAA[GC][AC]TT[GC])/; print join "t", $id, $m if $m;' input.fasta

The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-l : Strip the input line separator ("n" on *NIX by default) before executing the code in-line, and append it when printing.

SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches

Correct answer by Timur Shtatland on June 7, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP