Metagenomics: Identifying most common sequences

Question

I am working on a project and used the following command:
vsearch --derep_fulllength filtered_merged.fa -sizeout -relabel Uniq -output dereplicated_filtered_merged.fa

and got the following output:
87373926 nt in 203453 seqs, min 310, max 480, avg 352
Sorting 100%
10981 unique sequences, avg cluster 2.0, median 1, max 1287
Writing output file 100%

The output had provided me with the data that 10981 unique sequences have been identified. But I cant seem to identify how many reads of the most common sequence were present in the input data.
Any suggestions will be kindly appreciated!

Maximilian Press · Accepted Answer

According to the VSEARCH docs, since you have specified --sizeout your abundances have been written into the FASTA headers:

--sizeout
Take into account the abundance annotations present in the input fasta file (search for the pattern ’[>;]size=integer[;]’ in sequence headers). That option is active by default when rereplicating.
Add abundance annotations to the output fasta file (add the pattern ’;size=integer;’ to sequence headers). If --sizein is specified, each unique sequence receives a new abun- dance value corresponding to its total abundance (sum of the abundances of its occur- rences). If --sizein is not specified, input abundances are set to 1, and each unique sequence receives a new abundance value corresponding to its number of occurrences in the input file.

Metagenomics: Identifying most common sequences

One Answer

Add your own answers!

Ask a Question