TransWikia.com

How to remove sequences from a fasta file using a sequence ID list which contains a space within the id?

Bioinformatics Asked on July 15, 2021

I have a fasta file that contains sequence reads and sequence id file that needed to be removed from the fasta file. I have done this earlier, but since id contains a space my piece of code is not working and I don’t know how to change it since I’m still learning. fasta file looks like this.

>m64041_200717_231916/74/12723_24868 id=30
CCGCACCTCCTCAATCTGCAGCAGTTGAGGCCACTACCCTTCTGCTCAATGGTTCCTGCAGACTTTATCATAGTCACTCACACTTGTCCA
>m64041_200717_231916/77/1941_50622 id=3115
TGAGCCGCACCTCCTCAATCTGCAGCAGTTGAGGCCACTACCCTTCTGCTCAATGGTTCCTGCAGACTTTATCATAGTCACTCACACTTGTCCATGAG
>m64041_200717_231916/105/20691_65844 id=488
AGGCCACTACCCTTCTGCTCAATGGTTCCTGCAGACTTTATCATAGTCACTCACACTTGTCCAAGACTTTAT
>m64041_200717_231916/108/17414_67048 id=4956
TGAGGCCACTACCCTTCTGCTCAATGGTTCCTGCAGACTTTATCATAGTCACTCACACTTGTCCATGAGGCAGACTTTATCATAGTCACTCACACTTG
>m64041_200717_231916/162/0_6615 id=857
CAGACTTTATCATAGTCACTCACACTTGTCCATGAGGCAGACTTTATCATAGTCACTCACACTTG

IDList file is like this.

m64041_200717_231916/74/12723_24868 id=30
m64041_200717_231916/108/17414_67048 id=4956
m64041_200717_231916/105/20691_65844 id=488

I need to remove these sequences from that fasta file. I have a code I used earlier. But it is not working since id contains a space (I think). Previous code is like this.

# use as follows
# ./filter_seq.bash <fastafile> <list_of_ids_you_want_to_remove>
# ./filter_seq.bash all.fasta bacterial_IDlist

grep ">" $1 | sed 's/>//g' | sort | uniq > all_del
sort $2 | uniq > query_del
comm -23 all_del query_del | sort | uniq > filtered_del
perl -ne 'if(/^>(S+)/){$c=$i{$1}}$c?print:chomp;$i{$_}=1 if @ARGV' filtered_del $1 > filtered.fasta
rm *_del

I would be grateful if someone can help me.

2 Answers

Install (pip) biopython if u don't have it.

from Bio import SeqIO
with open('/path/ids.txt', 'r') as fin:
    to_keep = set([])
    for line in fin:
        to_keep.add(line.strip().split(' ')[0])
with open('/path/output.fa', 'w') as fout:
    for record in SeqIO.parse('/path/sequences.fa','fasta'):
        if record.id in to_keep:
            SeqIO.write(record, fout, 'fasta')

Answered by Liam McIntyre on July 15, 2021

I tend to do this using the FastaToTbl and TblToFasta scripts I have posted before. FastaToTbl will convert fastq sequences to single-line, tab-separated recors which makes them far easier to parse. Using these two scripts, you can do:

$ FastaToTbl file.fa | grep -vwf idList | TblToFasta 
>m64041_200717_231916/77/1941_50622 id=3115 
TGAGCCGCACCTCCTCAATCTGCAGCAGTTGAGGCCACTACCCTTCTGCTCAATGGTTCC
TGCAGACTTTATCATAGTCACTCACACTTGTCCATGAG
>m64041_200717_231916/162/0_6615 id=857 
CAGACTTTATCATAGTCACTCACACTTGTCCATGAGGCAGACTTTATCATAGTCACTCAC
ACTTG

The -v option will tell grep to return non-matching lines, and the -f option tells it to read the patterns to search for from a file. In this case, we give it the file with the IDs to skip (idList). The -w ensures we only find "whole word" matches, so that something like m64041_200717_231916/77/1941_50622 id=31151111 will not count as a match for m64041_200717_231916/77/1941_50622 id=31151111.


Another approach would be to just use awk directly:

$ awk '(NR==FNR){s[">"$0]++} (/^>/){ s[$0] ? a=0 : a=1}a' idList file.fa 
>m64041_200717_231916/77/1941_50622 id=3115
TGAGCCGCACCTCCTCAATCTGCAGCAGTTGAGGCCACTACCCTTCTGCTCAATGGTTCCTGCAGACTTTATCATAGTCACTCACACTTGTCCATGAG
>m64041_200717_231916/162/0_6615 id=857
CAGACTTTATCATAGTCACTCACACTTGTCCATGAGGCAGACTTTATCATAGTCACTCACACTTG

Here is the same awk script slightly expanded to make it easier to understand:

 awk '{
        if (NR==FNR){idsToSkip[">"$0]++} 
        else if (/^>/){ 
            if($0 in idsToSkip){
                wantThisId = 0
            }
            else{
                wantThisId = 1
            }
        }
        if(wantThisId == 1){
            print
        }
      }' idList file.fa 

Answered by terdon on July 15, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP