Bioinformatics Asked on November 27, 2021
I have some sequences shown below
CAGGTAGCC
CCGGTCAGA
AGGGTTTGA
TTGGTGAGG
CAAGTATGA
ACTGTATGC
CTGGTAACC
TATGTACTG
GCTGTGAGA
CAGGTGGGC
TCAGTGAGA
GGGGTGAGT
TGGGTATGT
GAGGTGAGA
CAGGTGGAG
Each line has 9 nucleotides.
Consider it to be 9 columns.I want to calculate the nucleotide frequency of each nucleotide for each of the 9 columns.
For example 1st column will have these bases C,C,A,T,C,A etc
Out put should be something like this
A 0.25 0.34 0.56 0.43 0.00 0.90 0.45 0.34 0.31
C 0.45 0.40 0.90 0.00 0.40 0.90 0.30 0.25 0.2
G 0.00 0.00 0.34 1.00 0.30 0.30 0.35 0.90 0.1
T 0.24 0.56 0.00 0.00 1.00 0.34 0.45 0.35 0.36
Note, I just made up the numbers to show you the output file format
benn’s answer works but is very un-R-like: you wouldn’t normally iteratively assign to items in a matrix: R has powerful vectorised operations which make this manual work unnecessary. Here’s a more idiomatic solution:
sequences = readLines('sequences.txt')
bases_matrix = do.call(rbind, strsplit(sequences, ''))
apply(bases_matrix, 2L, function (col) {
str = DNAString(paste(col, collapse = ''))
letterFrequency(str, letters = 'ACGT', OR = 0L, as.prob = TRUE)
})
This uses the same Bioconductor packages. Since this is such a simple problem, it can also be written without Bioconductor:
bases = strsplit(sequences, '')
# Use a data.frame here so we can use factors in the next step:
# R does not support matrices of factors. Ugh.
bases_by_column = setNames(do.call(rbind.data.frame, bases), seq_along(bases[[1L]]))
# Ensure that every column will be a complete set of ACGT frequencies
bases_by_column = lapply(bases_by_column, factor, c('A', 'C', 'G', 'T'))
sapply(lapply(bases_by_columns, table), prop.table)
Using modern R idioms from the ‘magrittr’ package, I’d write this as a pipeline; this very directly shows the sequence of transformations.
do.call(rbind.data.frame, bases) %>%
setNames(seq_along(bases[[1L]])) %>%
lapply(factor, c('A', 'C', 'G', 'T')) %>%
lapply(table) %>%
sapply(prop.table)
Answered by Konrad Rudolph on November 27, 2021
Here is how you get the distributions by column using the TraMineR
R package.
library(TraMineR)
sts.data <- c(
"CAGGTAGCC",
"CCGGTCAGA",
"AGGGTTTGA",
"TTGGTGAGG",
"CAAGTATGA",
"ACTGTATGC",
"CTGGTAACC",
"TATGTACTG",
"GCTGTGAGA",
"CAGGTGGGC",
"TCAGTGAGA",
"GGGGTGAGT",
"TGGGTATGT",
"GAGGTGAGA",
"CAGGTGGAG"
)
seq <- seqdef(seqdecomp(sts.data,sep=''))
seqstatd(seq)
and here is the outcome of the seqstatd
function
[State frequencies]
[1] [2] [3] [4] [5] [6] [7] [8] [9]
A 0.13 0.40 0.13 0 0 0.400 0.467 0.067 0.40
C 0.40 0.27 0.00 0 0 0.067 0.067 0.133 0.27
G 0.20 0.20 0.67 1 0 0.467 0.200 0.733 0.20
T 0.27 0.13 0.20 0 1 0.067 0.267 0.067 0.13
[Valid states]
[1] [2] [3] [4] [5] [6] [7] [8] [9]
N 15 15 15 15 15 15 15 15 15
[Entropy index]
[1] [2] [3] [4] [5] [6] [7] [8] [9]
H 0.94 0.94 0.62 0 0 0.78 0.87 0.62 0.94
Answered by Gilbert on November 27, 2021
For anyone interested in doing this from any sort of alignment file, I've implemented a position frequency matrix function in AlignBuddy.
input:
15 9
seq_1 CAGGTAGCC
seq_2 CCGGTCAGA
seq_3 AGGGTTTGA
seq_4 TTGGTGAGG
seq_5 CAAGTATGA
seq_6 ACTGTATGC
seq_7 CTGGTAACC
seq_8 TATGTACTG
seq_9 GCTGTGAGA
seq_10 CAGGTGGGC
seq_11 TCAGTGAGA
seq_12 GGGGTGAGT
seq_13 TGGGTATGT
seq_14 GAGGTGAGA
seq_15 CAGGTGGAG
Command:
:$ alignbuddy input.phy --pos_freq_mat
Output:
### Alignment 1 ###
A 0.133 0.400 0.133 0.000 0.000 0.400 0.467 0.067 0.400
C 0.400 0.267 0.000 0.000 0.000 0.067 0.067 0.133 0.267
G 0.200 0.200 0.667 1.000 0.000 0.467 0.200 0.733 0.200
T 0.267 0.133 0.200 0.000 1.000 0.067 0.267 0.067 0.133
Answered by Steve Bond on November 27, 2021
awk
awk '{L=length($1);for(i=1;i<=L;i++) {B=substr($1,i,1);T[i][B]++;}} END{for(BI=0;BI<4;BI++) {B=(BI==0?"A":(BI==1?"C":(BI==2?"G":"T")));printf("%s",B); for(i in T) {tot=0.0;for(B2 in T[i]){tot+=T[i][B2];}printf("t%0.2f",(T[i][B]/tot));} printf("n");}}' input.txt
A 0.13 0.40 0.13 0.00 0.00 0.40 0.47 0.07 0.40
C 0.40 0.27 0.00 0.00 0.00 0.07 0.07 0.13 0.27
G 0.20 0.20 0.67 1.00 0.00 0.47 0.20 0.73 0.20
T 0.27 0.13 0.20 0.00 1.00 0.07 0.27 0.07 0.13
Or, expanded for clarity:
awk '{
L=length($1);
for(i=1;i<=L;i++) {
B=substr($1,i,1);
T[i][B]++;
}
}
END{
for(BI=0;BI<4;BI++) {
B=(BI==0?"A":(BI==1?"C":(BI==2?"G":"T")));
printf("%s",B);
for(i in T) {
tot=0.0;
for(B2 in T[i]){
tot+=T[i][B2];
}
printf("t%0.2f",(T[i][B]/tot));
}
printf("n");
}
}' input.txt
Answered by Pierre on November 27, 2021
Here's a Perl approach:
#!/usr/bin/env perl
use strict;
my %counts;
## Read the input file line by line
while (my $line = <>) {
print;
## remove trailing 'n' characters
chomp $line;
## split the line into an array at every character
my @columns=split(/./, $line);
## iterate over the array from the first to the last position
for my $i (0..$#columns){
## The nucleotide at this position
my $nt = $columns[$i];
## Save the count in this hash of arrays. The keys are the
## nucleotides, and the value at each position is increased
## when that nucleotide is found in that position.
$counts{$nt}[$i]++;
}
}
## Iterate over the counts hash
for my $nt (sort keys(%counts)){
print "$ntt";
## dereference the array stored in the hash
my @countsForThisNt = @{$counts{$nt}};
## iterate over the counts for each position for this nt
for (my $l=0;$l<=$#countsForThisNt; $l++) {#
## If the value for this position isn't defined,
## set it to 0.
$countsForThisNt[$l]||=0;
## Print all the things
printf "%.2ft", $countsForThisNt[$l]/$., $l;
}
print "n";
}
Save the script somewhere in your PATH, make it executable and run:
$ foo.pl file
A 0.13 0.40 0.13 0.00 0.00 0.40 0.47 0.07 0.40
C 0.40 0.27 0.00 0.00 0.00 0.07 0.07 0.13 0.27
G 0.20 0.20 0.67 1.00 0.00 0.47 0.20 0.73 0.20
T 0.27 0.13 0.20 0.00 1.00 0.07 0.27 0.07 0.13
Alternatively, if you're into the whole brevity thing, and enjoy some golfing, here's the same thing as a one-liner:
perl -ne 'chomp;@F=split(//);for$i(0..$#F){$k{$F[$i]}[$i]++}}{for $nt(sort keys(%k)){print"$ntt";for$i(0..$#{$k{$nt}}){$g=$k{$nt}[$i]||0; printf"%.2ft",$g/$.}print"n";}' file
Answered by terdon on November 27, 2021
Here an example in R using Biostrings
and letterFrequency
(as suggested by Devon Ryan).
library(Biostrings)
data <- read.table("DNA.txt", stringsAsFactors = F)
new <- matrix(nrow = 9, ncol = 15)
for(i in 1:9){
for(j in 1:15){
new[i,j] <- substring(data[j,], i, i)
}
}
countTable <- matrix(nrow = 9, ncol = 4)
for(i in 1:9){
columnSeq <- DNAStringSet(paste0(new[i,], collapse = ""))
columnCounts <- letterFrequency(columnSeq, letters = "ACGT", OR = 0)
countTable[i,] <- columnCounts
}
colnames(countTable) <- c("A", "C", "G", "T")
freqTable <- countTable/15
round(t(freqTable), digit = 2)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
A 0.13 0.40 0.13 0 0 0.40 0.47 0.07 0.40
C 0.40 0.27 0.00 0 0 0.07 0.07 0.13 0.27
G 0.20 0.20 0.67 1 0 0.47 0.20 0.73 0.20
T 0.27 0.13 0.20 0 1 0.07 0.27 0.07 0.13
Answered by benn on November 27, 2021
Here is one example of how to do this with a bit of python. Alternatively one could create strings of each column and using letterFrequency()
from the Biostrings
package.
#Make a list of hashes
hl = []
for i in range(9):
hl.append({'A': 0, 'C': 0, 'G': 0, 'T': 0})
f = open("foo.txt") # CHANGE ME
nLines = 0
for line in f:
for idx, c in enumerate(line.strip()):
hl[idx][c] += 1
nLines += 1
f.close()
nLines = float(nLines)
for char in ['A', 'C', 'G', 'T']:
print("{}t{}".format(char, "t".join(["{:0.2f}".format(x[char]/nLines) for x in hl])))
The output of your example is then:
A 0.13 0.40 0.13 0.00 0.00 0.40 0.47 0.07 0.40
C 0.40 0.27 0.00 0.00 0.00 0.07 0.07 0.13 0.27
G 0.20 0.20 0.67 1.00 0.00 0.47 0.20 0.73 0.20
T 0.27 0.13 0.20 0.00 1.00 0.07 0.27 0.07 0.13
Answered by Devon Ryan on November 27, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP