Bioinformatics Asked on July 15, 2021
I have a fasta file that contains sequence reads and sequence id file that needed to be removed from the fasta file. I have done this earlier, but since id contains a space my piece of code is not working and I don’t know how to change it since I’m still learning. fasta file looks like this.
>m64041_200717_231916/74/12723_24868 id=30
CCGCACCTCCTCAATCTGCAGCAGTTGAGGCCACTACCCTTCTGCTCAATGGTTCCTGCAGACTTTATCATAGTCACTCACACTTGTCCA
>m64041_200717_231916/77/1941_50622 id=3115
TGAGCCGCACCTCCTCAATCTGCAGCAGTTGAGGCCACTACCCTTCTGCTCAATGGTTCCTGCAGACTTTATCATAGTCACTCACACTTGTCCATGAG
>m64041_200717_231916/105/20691_65844 id=488
AGGCCACTACCCTTCTGCTCAATGGTTCCTGCAGACTTTATCATAGTCACTCACACTTGTCCAAGACTTTAT
>m64041_200717_231916/108/17414_67048 id=4956
TGAGGCCACTACCCTTCTGCTCAATGGTTCCTGCAGACTTTATCATAGTCACTCACACTTGTCCATGAGGCAGACTTTATCATAGTCACTCACACTTG
>m64041_200717_231916/162/0_6615 id=857
CAGACTTTATCATAGTCACTCACACTTGTCCATGAGGCAGACTTTATCATAGTCACTCACACTTG
IDList file is like this.
m64041_200717_231916/74/12723_24868 id=30
m64041_200717_231916/108/17414_67048 id=4956
m64041_200717_231916/105/20691_65844 id=488
I need to remove these sequences from that fasta file. I have a code I used earlier. But it is not working since id contains a space (I think). Previous code is like this.
# use as follows
# ./filter_seq.bash <fastafile> <list_of_ids_you_want_to_remove>
# ./filter_seq.bash all.fasta bacterial_IDlist
grep ">" $1 | sed 's/>//g' | sort | uniq > all_del
sort $2 | uniq > query_del
comm -23 all_del query_del | sort | uniq > filtered_del
perl -ne 'if(/^>(S+)/){$c=$i{$1}}$c?print:chomp;$i{$_}=1 if @ARGV' filtered_del $1 > filtered.fasta
rm *_del
I would be grateful if someone can help me.
Install (pip) biopython if u don't have it.
from Bio import SeqIO
with open('/path/ids.txt', 'r') as fin:
to_keep = set([])
for line in fin:
to_keep.add(line.strip().split(' ')[0])
with open('/path/output.fa', 'w') as fout:
for record in SeqIO.parse('/path/sequences.fa','fasta'):
if record.id in to_keep:
SeqIO.write(record, fout, 'fasta')
Answered by Liam McIntyre on July 15, 2021
I tend to do this using the FastaToTbl and TblToFasta scripts I have posted before. FastaToTbl
will convert fastq sequences to single-line, tab-separated recors which makes them far easier to parse. Using these two scripts, you can do:
$ FastaToTbl file.fa | grep -vwf idList | TblToFasta
>m64041_200717_231916/77/1941_50622 id=3115
TGAGCCGCACCTCCTCAATCTGCAGCAGTTGAGGCCACTACCCTTCTGCTCAATGGTTCC
TGCAGACTTTATCATAGTCACTCACACTTGTCCATGAG
>m64041_200717_231916/162/0_6615 id=857
CAGACTTTATCATAGTCACTCACACTTGTCCATGAGGCAGACTTTATCATAGTCACTCAC
ACTTG
The -v
option will tell grep
to return non-matching lines, and the -f
option tells it to read the patterns to search for from a file. In this case, we give it the file with the IDs to skip (idList
). The -w
ensures we only find "whole word" matches, so that something like m64041_200717_231916/77/1941_50622 id=31151111
will not count as a match for m64041_200717_231916/77/1941_50622 id=31151111
.
Another approach would be to just use awk
directly:
$ awk '(NR==FNR){s[">"$0]++} (/^>/){ s[$0] ? a=0 : a=1}a' idList file.fa
>m64041_200717_231916/77/1941_50622 id=3115
TGAGCCGCACCTCCTCAATCTGCAGCAGTTGAGGCCACTACCCTTCTGCTCAATGGTTCCTGCAGACTTTATCATAGTCACTCACACTTGTCCATGAG
>m64041_200717_231916/162/0_6615 id=857
CAGACTTTATCATAGTCACTCACACTTGTCCATGAGGCAGACTTTATCATAGTCACTCACACTTG
Here is the same awk
script slightly expanded to make it easier to understand:
awk '{
if (NR==FNR){idsToSkip[">"$0]++}
else if (/^>/){
if($0 in idsToSkip){
wantThisId = 0
}
else{
wantThisId = 1
}
}
if(wantThisId == 1){
print
}
}' idList file.fa
Answered by terdon on July 15, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP