Bioinformatics Asked on August 27, 2021
I have a set of 4 amino acids which i have aligned,but i want to compare then with respective nucleotide sequence in terms of change codon level.
I have the alignment file where i want to take only for the C-terminal domain region region only the CTD starts with amino acids “NITNLC”
till it ends with “HAPATV”
and rest of the sequence i dont want to take.
In short i want to take the part sequence which starts from "NITNLC" to "HAPATV" then i would like to compare what are the changes in the nucleotide sequence at codon level.
To start with I have used the seqinr library to generate nucleotide alignment from aligned protein sequence.
reverse.align {seqinr}
Then the next step that i want to compare each of my amino acid sequence to my reference sequence which in my case is P0DTC2
to rest of the sequence for example I want to compare P0DTC2
to K9N5Q8
to the above mentioned CTD region which starts from NITNLC
and ends with HAPATV
and find what are the codon level changes if there is any I would like to report both in amino acid level and codon level.
First part I guess my approach is correct. Second part I’m not sure how to proceed i guess it’s much more than simple parsing I guess!!.
Any help or suggestion would be really appreciated ,and if there exist a R based solution that would be really welcome.
Files
Amino acid file nucleotide file Aligned amino acid file Reverse alignment nucleotide file using seqinr
To set the ball rolling, I tried to implement it manually with your files. Overall, there are three steps:
NITNLC
(or HAPATV
) in protein P0DTC2
.P0DTC2
and K9N5Q8
within the range from step 1.It works, but only for the first 60 amino acids. I wonder if it's due to the amino sequences in AMINOOO_seq_removed.fasta
repeat itself every 60 acids. But why?
#For example, the first three lines of protein P0DTC2
>P0DTC2
MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFS MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFS
NVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIV NVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIV
NNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLE NNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLE
...
Step 0: Read the files.
library('seqinr')
#align file containing protein sequences
count_added <- read.alignment('count_added_.clustal_num', format='clustal')
names(count_added$seq) <- count_added$nam
#DNA sequences
rev3.aln <- read.alignment('rev3.aln', format='fasta')
names(rev3.aln$seq) <- rev3.aln$nam
Step 1: Locate NITNLC
(or HAPATV
) in protein P0DTC2
. Again, there are two repeats of NITNLC
60 acids apart (at 807 and 870).
locate <- function(seq, find)
{address <- gregexpr(paste(strsplit(find, '')[[1]], collapse='[^a-z]*'), seq)
#substr(seq, address[[1]][1], address[[1]][1]+attr(address[[1]], 'match.length')[1]-1)
return(list(start=as.numeric(address[[1]]),
end=as.numeric(address[[1]] + attr(address[[1]], 'match.length') - 1)))
}
locate(seq=count_added$seq[['P0DTC2']], find='nitnlc')
#start
#807 870
#end
#812 875
locate(seq=count_added$seq[['P0DTC2']], find='hapatv')
#start
#1212 1279
#end
#1217 1285
Step 2: Locate mismatched amino acids between proteins P0DTC2
and K9N5Q8
within the range from step 1. Instead of using the range 812 - 1279, I chose range 1 - 20 for demonstration purposes.
compare <- function(seq1, seq2, after=0, before=100000)
{seq1_ = strsplit(seq1, '')[[1]]
seq2_ = strsplit(seq2, '')[[1]]
ind = which(seq1_ != seq2_ & grepl('[a-z]',seq1_) & grepl('[a-z]',seq2_))
ind = ind[ind>after & ind<before]
#seq1_[ind[1]]
#seq2_[ind[1]]
return(ind)
}
compare(seq1=count_added$seq[['P0DTC2']], seq2=count_added$seq[['K9N5Q8']], after=1, before=20)
# [1] 5 7 8 9 10 13 14
#protein comparison
#K9N5Q8 MIHSVFLLMFLLTPTESYVD
#P0DTC2 ----MFVFLVLLPL------
#<mismatch> 5 7890 34
Step 3: Print the amino acid and the DNA codon of both proteins from step 2. Note that ind
is based on the align file.
print_amino_codon <- function(ind, seq, seq_gene)
{locate_amino <- gregexpr('[a-z]', seq)[[1]]
if (!ind %in% locate_amino) return(NA)
ind2 = match(ind, locate_amino)
return(c(amino=substr(seq, ind, ind), codon=substr(seq_gene, ind2*3-2, ind2*3)))
}
codon(ind=5, seq=count_added$seq[['K9N5Q8']], seq_gene=rev3.aln$seq[['K9N5Q8']])
#amino codon
# "v" "atg"
codon(ind=6, vseq=count_added$seq[['K9N5Q8']], seq_gene=rev3.aln$seq[['K9N5Q8']])
#amino codon
# "f" "ttc"
codon(ind=7, seq=count_added$seq[['K9N5Q8']], seq_gene=rev3.aln$seq[['K9N5Q8']])
#amino codon
# "l" "ttg"
codon(ind=5, seq=count_added$seq[['P0DTC2']], seq_gene=rev3.aln$seq[['P0DTC2']])
#amino codon
# "m" "atg"
codon(ind=6, seq=count_added$seq[['P0DTC2']], seq_gene=rev3.aln$seq[['P0DTC2']])
#amino codon
# "f" "ttt"
codon(ind=7, seq=count_added$seq[['P0DTC2']], seq_gene=rev3.aln$seq[['P0DTC2']])
#amino codon
# "v" "gtt"
#protein comparison
#K9N5Q8 MIHSVFLLMFLLTPTESYVD
#P0DTC2 ----MFVFLVLLPL------
#<mismatch> 5 7890 34
#<print> ^^^
#K9N5Q8 gene codon
#gtg ttt cta ctg atg ttc ttg tta aca
# ^^^ ^^^ ^^^
#P0DTC2 gene codon
#atg ttt gtt ttt ctt
#^^^ ^^^ ^^^
Correct answer by Ryan SY Kwan on August 27, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP