Bioinformatics Asked by andresito on February 4, 2021
I have a fasta file like this:
>Id1
ATCCTT
>Id2
ATTTTCCC
>Id3
TTTCCCCAAAA
>Id4
CCCTTTAAA
I want to delete sequences that have the following IDs.
Id2
Id3
The IDs are in a .txt file, and the text file will be used to match and delete those sequences.
My output should be a fasta
file like this
>Id1
ATCCTT
>Id4
CCCTTTAAA
But I want it with awk
and/or sed
and/or bash
(no python or perl).
How can I do it?
Presume grep is OK if sed, awk are. The -A(n) and -B(n) flags give (n) lines post match. Presuming all your fasta sequence will be one line is a bit dangerous, but it works for your example. First get those IDs to be removed (in rmid.txt), then inverse grep against the initial fasta.
grep -A1 -f rmid.txt fasta.fa > rmfile.fa
grep -v -f rmfile.fa fasta.fa
The real answer is to use a script that defines a delimiter other than newline "n", then parse for IDs, so better to use one of the languages you don't want to use...
Answered by user1141 on February 4, 2021
Using the FastaToTbl
and TblToFasta
scripts I have posted before, you can do:
FastaToTbl file.fa | grep -vwf ids.txt | TblToFasta > file.2.fa
Answered by terdon on February 4, 2021
If you want to learn how to do things with command-line tools, you can linearize the FASTA with awk
, pipe to grep
to filter for items of interest named in patterns.txt
, then pipe to tr
to delinearize:
$ awk '{ if ((NR>1)&&($0~/^>/)) { printf("n%s", $0); } else if (NR==1) { printf("%s", $0); } else { printf("t%s", $0); } }' in.fa | grep -Ff patterns.txt - | tr "t" "n" > out.fa
This will work on multiline FASTA.
Answered by Alex Reynolds on February 4, 2021
Suppose you keep sequence names in ids.txt
and sequences in seq.fa
:
awk 'BEGIN{while((getline<"ids.txt")>0)l[">"$1]=1}/^>/{f=!l[$1]}f' seq.fa
Answered by user172818 on February 4, 2021
Just today I wrote a script to do exactly this using Biopython
. I also add a warning If any the headers I wanted to filter was not present in the original fasta. So the python script filter_fasta_by_list_of_headers.py
is :
#!/usr/bin/env python3
from Bio import SeqIO
import sys
ffile = SeqIO.parse(sys.argv[1], "fasta")
header_set = set(line.strip() for line in open(sys.argv[2]))
for seq_record in ffile:
try:
header_set.remove(seq_record.name)
except KeyError:
print(seq_record.format("fasta"))
continue
if len(header_set) != 0:
print(len(header_set),'of the headers from list were not identified in the input fasta file.', file=sys.stderr)
and usage :
filter_fasta_by_list_of_headers.py input.fasta list_of_scf_to_filter > filtered.fasta
P.S. it's quite easy to turn over the script to extract the sequences from the list (just the print line would have to move after line header_set.remove(seq_record.name
)
Answered by Kamil S Jaron on February 4, 2021
This is what samtools faidx
is intended for.
It needs to be called twice.
First, to create a FASTA index (*.fai
file):
samtools faidx input.fasta
Secondly, to filter the FASTA file with the help of the index:
samtools faidx -o output.fasta input.fasta ids…
ids…
are the IDs to retain, as individual arguments separated by space. If you’ve got a file blocklist.txt
with IDs you want to discard (one per line), you first need to invert this, after having created the index (using Bash syntax):1
remove_ids=($(awk '{print $1}' input.fasta.fai | grep -v -f blocklist.txt))
… and then we can invoke the command as
samtools faidx -o output.fasta input.fasta "${remove_ids[@]}"
What’s nice about samtools faidx
is that it also works with BGZF-compressed FASTA data — i.e. fa.gz
files (but only if they were compressed using bgzip
, which ships with Samtools, and the index must be generated on the compressed file). However, it will always write FASTA. If you want gzipped FASTA output, you need to manually stream the result into bgzip
:
samtools faidx input.fasta.gz "${remove_ids[@]}" | bgzip -c >output.fasta.gz
1 Careful, this syntax is generally unsafe for reading a command output into a variable; we can use it here only because we assume that we want to generate a space-separated list of identifiers; the generally safe way is a lot more convoluted.
Answered by Konrad Rudolph on February 4, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP