Comparing aligned amino acids to codon

Question

I have a set of 4 amino acids which i have aligned,but i want to compare then with respective  nucleotide sequence in terms of change codon level.
I have the alignment file where i want to take only for the C-terminal domain region region only the CTD starts with amino acids “NITNLC” till it ends with “HAPATV” and rest of the sequence i dont want to take.
In short i want to take the part sequence which starts from "NITNLC" to "HAPATV" then i would like to compare what are the changes in the nucleotide sequence at codon level.

To start with I have used the seqinr library to generate nucleotide alignment from aligned protein sequence.
reverse.align {seqinr}

Then the next step that i want to compare each of my amino acid sequence to my reference sequence which in my case is P0DTC2to rest of the sequence for example I want to compare P0DTC2 to K9N5Q8to the above mentioned CTD region which starts from NITNLC and ends with HAPATVand find what are the codon level changes if there is any I would like to report both in amino acid level and codon level.

First part I guess my approach is correct. Second part I'm not sure how to proceed i guess it's much more than simple parsing I guess!!.
Any help or suggestion would be really appreciated ,and if there exist a R based solution that would be really welcome.
Files
Amino acid file nucleotide file Aligned amino acid file Reverse alignment nucleotide file using seqinr

Ryan SY Kwan · Accepted Answer

To set the ball rolling, I tried to implement it manually with your files. Overall, there are three steps:

Locate NITNLC (or HAPATV) in protein P0DTC2.
Locate mismatched amino acids between proteins P0DTC2 and K9N5Q8 within the range from step 1.
Print the amino acid and the DNA codon of both proteins from step 2.

It works, but only for the first 60 amino acids. I wonder if it's due to the amino sequences in AMINOOO_seq_removed.fasta repeat itself every 60 acids. But why?

#For example, the first three lines of protein P0DTC2
>P0DTC2
MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFS MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFS
NVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIV NVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIV
NNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLE NNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLE
...

Step 0: Read the files.

library('seqinr')

#align file containing protein sequences
count_added <- read.alignment('count_added_.clustal_num', format='clustal')
names(count_added$seq) <- count_added$nam

#DNA sequences
rev3.aln <- read.alignment('rev3.aln', format='fasta')
names(rev3.aln$seq) <- rev3.aln$nam

Step 1: Locate NITNLC (or HAPATV) in protein P0DTC2. Again, there are two repeats of NITNLC 60 acids apart (at 807 and 870).

locate <- function(seq, find)
  {address <- gregexpr(paste(strsplit(find, '')[[1]], collapse='[^a-z]*'), seq)
  #substr(seq, address[[1]][1], address[[1]][1]+attr(address[[1]], 'match.length')[1]-1)

  return(list(start=as.numeric(address[[1]]), 
             end=as.numeric(address[[1]] + attr(address[[1]], 'match.length') - 1)))
  }

locate(seq=count_added$seq[['P0DTC2']], find='nitnlc')
#start
#807 870
#end
#812 875
locate(seq=count_added$seq[['P0DTC2']], find='hapatv')
#start
#1212 1279
#end
#1217 1285

Step 2: Locate mismatched amino acids between proteins P0DTC2 and K9N5Q8 within the range from step 1. Instead of using the range 812 - 1279, I chose range 1 - 20 for demonstration purposes.

compare <- function(seq1, seq2, after=0, before=100000)
  {seq1_ = strsplit(seq1, '')[[1]]
  seq2_ = strsplit(seq2, '')[[1]]
  ind = which(seq1_ != seq2_ & grepl('[a-z]',seq1_) & grepl('[a-z]',seq2_))
  ind = ind[ind>after & ind<before]
  #seq1_[ind[1]]
  #seq2_[ind[1]]

  return(ind)
  }
compare(seq1=count_added$seq[['P0DTC2']], seq2=count_added$seq[['K9N5Q8']], after=1, before=20)
# [1] 5  7  8  9 10 13 14

#protein comparison
#K9N5Q8      MIHSVFLLMFLLTPTESYVD
#P0DTC2      ----MFVFLVLLPL------
#<mismatch>      5 7890  34

Step 3: Print the amino acid and the DNA codon of both proteins from step 2. Note that ind is based on the align file.

print_amino_codon <- function(ind, seq, seq_gene)
  {locate_amino <- gregexpr('[a-z]', seq)[[1]]
  if (!ind %in% locate_amino) return(NA)
  ind2 = match(ind, locate_amino)

  return(c(amino=substr(seq, ind, ind), codon=substr(seq_gene, ind2*3-2, ind2*3)))
  }
 
codon(ind=5, seq=count_added$seq[['K9N5Q8']], seq_gene=rev3.aln$seq[['K9N5Q8']])
#amino codon 
#  "v" "atg" 
codon(ind=6, vseq=count_added$seq[['K9N5Q8']], seq_gene=rev3.aln$seq[['K9N5Q8']])
#amino codon 
#  "f" "ttc" 
codon(ind=7, seq=count_added$seq[['K9N5Q8']], seq_gene=rev3.aln$seq[['K9N5Q8']])
#amino codon 
#  "l" "ttg"

codon(ind=5, seq=count_added$seq[['P0DTC2']], seq_gene=rev3.aln$seq[['P0DTC2']])
#amino codon 
#  "m" "atg" 
codon(ind=6, seq=count_added$seq[['P0DTC2']], seq_gene=rev3.aln$seq[['P0DTC2']])
#amino codon 
#  "f" "ttt" 
codon(ind=7, seq=count_added$seq[['P0DTC2']], seq_gene=rev3.aln$seq[['P0DTC2']])
#amino codon 
#  "v" "gtt" 

#protein comparison
#K9N5Q8      MIHSVFLLMFLLTPTESYVD
#P0DTC2      ----MFVFLVLLPL------
#<mismatch>      5 7890  34
#<print>         ^^^

#K9N5Q8 gene codon
#gtg ttt cta ctg atg ttc ttg tta aca
#                ^^^ ^^^ ^^^
#P0DTC2 gene codon
#atg ttt gtt ttt ctt
#^^^ ^^^ ^^^

Comparing aligned amino acids to codon

One Answer

Add your own answers!

Ask a Question