Bioinformatics Asked on December 9, 2021
Sometimes it useful to perform a nucleotide protein coding gene sequence alignment based on codons, not on individual nucleotides. For example for further codon model analysis it is important to have full codons.
A widely used approach here is to perform a protein sequence alignment first and then impose this alignment to the nucleotide sequences using PAL2NAL, CodonAlign or something similar.
This is how transAlign or GUIDANCE (in codon mode) work.
The problem here is that you are discarding part of the information which could be potentially used for the sequence alignment. E.g. if you have slowly evolving low-complexity region adjacent to a quickly evolving one, the amino acid induced alignment could be wrong, while incorporating nucleotide sequence potentially allows to make the alignment more accurate.
I’m aware of two programs which can do true codon alignment. First, PRANK has a dedicated codon model, but it is rather slow and using it is overkill for certain problems. Second, Sequence Manipulation Suite can perform codon alignments, but only for a pair of sequences; also it’s javascript based, therefore it is hard to run it for a large number of sequences.
Can you recommend any software for multiple codon sequence alignment? Preferably available for offline use.
Try MACSE v2 (https://academic.oup.com/mbe/article/35/10/2582/5079334) will align multiple protein-coding nucleotide sequences based on their amino acid translation while allowing for the occurrence of frameshifts
Answered by user90 on December 9, 2021
Answered by kristof theys on December 9, 2021
I don't know of any transcript-to-transcript aligners that are able to do this, but LAST can align transcript queries to protein reference sequences using a specified frameshift cost. Here's the specific documentation for that option:
-F COST
Align DNA queries to protein reference sequences, using the specified frameshift cost. A value of 15 seems to be reasonable. (As a special case, -F0 means DNA-versus-protein alignment without frameshifts, which is faster.) The output looks like this:
a score=108 s prot 2 40 + 649 FLLQAVKLQDP-STPHQIVPSP-VSDLIATHTLCPRMKYQDD s dna 8 117 + 999 FFLQ-IKLWDPSTPH*IVSSP/PSDLISAHTLCPRMKSQDN
The indicates a forward shift by one nucleotide, and the / indicates a reverse shift by one nucleotide. The * indicates a stop codon. The same alignment in tabular format looks like this:
108 prot 2 40 + 649 dna 8 117 + 999 4,1:0,6,0:1,10,0:-1,19
The "-1" indicates the reverse frameshift.
I sent an email to the LAST mailing list about adding a frameshift penalty for transcript-to-transcript matching; I've been pleasantly surprised with the requested features that Martin Frith has added to LAST in the past. Unfortunately, in this case the problem is too difficult to sort out due to all the possible combinations that could happen, so it's unlikely to be implemented in LAST in the forseeable future (unless someone else writes that code).
Answered by gringer on December 9, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP