Which sequence alignment tools support codon alignment?

Question

Sometimes it useful to perform a nucleotide protein coding gene sequence alignment based on codons, not on individual nucleotides. For example for further codon model analysis it is important to have full codons.

A widely used approach here is to perform a protein sequence alignment first and then impose this alignment to the nucleotide sequences using PAL2NAL, CodonAlign or something similar.

This is how transAlign or GUIDANCE (in codon mode) work.

The problem here is that you are discarding part of the information which could be potentially used for the sequence alignment. E.g. if you have slowly evolving low-complexity region adjacent to a quickly evolving one, the amino acid induced alignment could be wrong, while incorporating nucleotide sequence potentially allows to make the alignment more accurate.

I'm aware of two programs which can do true codon alignment. First, PRANK has a dedicated codon model, but it is rather slow and using it is overkill for certain problems. Second, Sequence Manipulation Suite can perform codon alignments, but only for a pair of sequences; also it's javascript based, therefore it is hard to run it for a large number of sequences.

Can you recommend any software for multiple codon sequence alignment? Preferably available for offline use.

user90 · Answer

Try MACSE v2 (https://academic.oup.com/mbe/article/35/10/2582/5079334) will align multiple protein-coding nucleotide sequences based on their amino acid translation while allowing for the occurrence of frameshifts

Answered by user90 on December 9, 2021

kristof theys · Answer

Virulign does this for virus sequences, the publication is available here and github address here

gringer · Answer

I don't know of any transcript-to-transcript aligners that are able to do this, but LAST can align transcript queries to protein reference sequences using a specified frameshift cost. Here's the specific documentation for that option:

-F COST   
  
  Align DNA queries to protein reference sequences, using the specified
  frameshift cost. A value of 15 seems to be reasonable. (As a special
  case, -F0 means DNA-versus-protein alignment without frameshifts,
  which is faster.) The output looks like this:

a score=108 s prot 2  40 + 649
FLLQAVKLQDP-STPHQIVPSP-VSDLIATHTLCPRMKYQDD s dna  8 117 + 999
FFLQ-IKLWDPSTPH*IVSSP/PSDLISAHTLCPRMKSQDN

The  indicates a forward shift by one nucleotide, and the / indicates
  a reverse shift by one nucleotide. The * indicates a stop codon. The
  same alignment in tabular format looks like this:

108 prot 2 40 + 649 dna 8 117 + 999 4,1:0,6,0:1,10,0:-1,19

The "-1" indicates the reverse frameshift.

I sent an email to the LAST mailing list about adding a frameshift penalty for transcript-to-transcript matching; I've been pleasantly surprised with the requested features that Martin Frith has added to LAST in the past. Unfortunately, in this case the problem is too difficult to sort out due to all the possible combinations that could happen, so it's unlikely to be implemented in LAST in the forseeable future (unless someone else writes that code).

Which sequence alignment tools support codon alignment?

3 Answers

Add your own answers!

Ask a Question