calculating nucleotide frequency per column

Question

I have some sequences shown below
CAGGTAGCC
CCGGTCAGA
AGGGTTTGA
TTGGTGAGG
CAAGTATGA
ACTGTATGC
CTGGTAACC
TATGTACTG
GCTGTGAGA
CAGGTGGGC
TCAGTGAGA
GGGGTGAGT
TGGGTATGT
GAGGTGAGA
CAGGTGGAG

Each line has 9 nucleotides.
Consider it to be 9 columns.I want to calculate the nucleotide frequency of each nucleotide for each of the 9 columns.
For example 1st column will have these bases C,C,A,T,C,A etc
Out put should be something like this
A   0.25    0.34    0.56    0.43    0.00    0.90    0.45    0.34    0.31
C   0.45    0.40    0.90    0.00    0.40    0.90    0.30    0.25    0.2
G   0.00    0.00    0.34    1.00    0.30    0.30    0.35    0.90    0.1
T   0.24    0.56    0.00    0.00    1.00    0.34    0.45    0.35    0.36

Note, I just made up the numbers to show you the output file format

Konrad Rudolph · Answer

benn’s answer works but is very un-R-like: you wouldn’t normally iteratively assign to items in a matrix: R has powerful vectorised operations which make this manual work unnecessary. Here’s a more idiomatic solution:
sequences = readLines('sequences.txt')
bases_matrix = do.call(rbind, strsplit(sequences, ''))

apply(bases_matrix, 2L, function (col) {
    str = DNAString(paste(col, collapse = ''))
    letterFrequency(str, letters = 'ACGT', OR = 0L, as.prob = TRUE)
})

This uses the same Bioconductor packages. Since this is such a simple problem, it can also be written without Bioconductor:
bases = strsplit(sequences, '')
# Use a data.frame here so we can use factors in the next step:
# R does not support matrices of factors. Ugh.
bases_by_column = setNames(do.call(rbind.data.frame, bases), seq_along(bases[[1L]]))
# Ensure that every column will be a complete set of ACGT frequencies
bases_by_column = lapply(bases_by_column, factor, c('A', 'C', 'G', 'T'))
sapply(lapply(bases_by_columns, table), prop.table)

Using modern R idioms from the ‘magrittr’ package, I’d write this as a pipeline; this very directly shows the sequence of transformations.
do.call(rbind.data.frame, bases) %>%
    setNames(seq_along(bases[[1L]])) %>%
    lapply(factor, c('A', 'C', 'G', 'T')) %>%
    lapply(table) %>%
    sapply(prop.table)

Gilbert · Answer

Here is how you get the distributions by column using the TraMineR R package.

library(TraMineR)

sts.data <- c(
  "CAGGTAGCC",
  "CCGGTCAGA",
  "AGGGTTTGA",
  "TTGGTGAGG",
  "CAAGTATGA",
  "ACTGTATGC",
  "CTGGTAACC",
  "TATGTACTG",
  "GCTGTGAGA",
  "CAGGTGGGC",
  "TCAGTGAGA",
  "GGGGTGAGT",
  "TGGGTATGT",
  "GAGGTGAGA",
  "CAGGTGGAG"
)
seq <- seqdef(seqdecomp(sts.data,sep=''))

seqstatd(seq)

and here is the outcome of the seqstatd function

[State frequencies]
   [1]  [2]  [3] [4] [5]   [6]   [7]   [8]  [9]
A 0.13 0.40 0.13   0   0 0.400 0.467 0.067 0.40
C 0.40 0.27 0.00   0   0 0.067 0.067 0.133 0.27
G 0.20 0.20 0.67   1   0 0.467 0.200 0.733 0.20
T 0.27 0.13 0.20   0   1 0.067 0.267 0.067 0.13

[Valid states]
  [1] [2] [3] [4] [5] [6] [7] [8] [9]
N  15  15  15  15  15  15  15  15  15

[Entropy index]
   [1]  [2]  [3] [4] [5]  [6]  [7]  [8]  [9]
H 0.94 0.94 0.62   0   0 0.78 0.87 0.62 0.94

Steve Bond · Answer

For anyone interested in doing this from any sort of alignment file, I've implemented a position frequency matrix function in AlignBuddy.

input:

15 9
seq_1      CAGGTAGCC
seq_2      CCGGTCAGA
seq_3      AGGGTTTGA
seq_4      TTGGTGAGG
seq_5      CAAGTATGA
seq_6      ACTGTATGC
seq_7      CTGGTAACC
seq_8      TATGTACTG
seq_9      GCTGTGAGA
seq_10     CAGGTGGGC
seq_11     TCAGTGAGA
seq_12     GGGGTGAGT
seq_13     TGGGTATGT
seq_14     GAGGTGAGA
seq_15     CAGGTGGAG

Command:

:$ alignbuddy input.phy --pos_freq_mat

Output:

### Alignment 1 ###
A       0.133   0.400   0.133   0.000   0.000   0.400   0.467   0.067   0.400
C       0.400   0.267   0.000   0.000   0.000   0.067   0.067   0.133   0.267
G       0.200   0.200   0.667   1.000   0.000   0.467   0.200   0.733   0.200
T       0.267   0.133   0.200   0.000   1.000   0.067   0.267   0.067   0.133

Pierre · Answer

awk

awk '{L=length($1);for(i=1;i<=L;i++) {B=substr($1,i,1);T[i][B]++;}} END{for(BI=0;BI<4;BI++) {B=(BI==0?"A":(BI==1?"C":(BI==2?"G":"T")));printf("%s",B); for(i in T) {tot=0.0;for(B2 in T[i]){tot+=T[i][B2];}printf("t%0.2f",(T[i][B]/tot));} printf("n");}}' input.txt

A   0.13    0.40    0.13    0.00    0.00    0.40    0.47    0.07    0.40
C   0.40    0.27    0.00    0.00    0.00    0.07    0.07    0.13    0.27
G   0.20    0.20    0.67    1.00    0.00    0.47    0.20    0.73    0.20
T   0.27    0.13    0.20    0.00    1.00    0.07    0.27    0.07    0.13

Or, expanded for clarity:

awk '{
        L=length($1);
        for(i=1;i<=L;i++) {
            B=substr($1,i,1);
            T[i][B]++;
        }
      } 
      END{
            for(BI=0;BI<4;BI++) {
                B=(BI==0?"A":(BI==1?"C":(BI==2?"G":"T")));
                printf("%s",B); 
                for(i in T) {
                    tot=0.0;
                    for(B2 in T[i]){
                        tot+=T[i][B2];
                    }
                printf("t%0.2f",(T[i][B]/tot));
                } 
            printf("n");
            }
}' input.txt

terdon · Answer

Here's a Perl approach:

#!/usr/bin/env perl
use strict;
my %counts;

## Read the input file line by line
while (my $line = <>) {
  print;
  ## remove trailing 'n' characters
  chomp $line;
  ## split the line into an array at every character
  my @columns=split(/./, $line); 
  ## iterate over the array from the first to the last position
  for my $i (0..$#columns){
    ## The nucleotide at this position
    my $nt = $columns[$i];    
    ## Save the count in this hash of arrays. The keys are the
    ## nucleotides, and the value at each position is increased
    ## when that nucleotide is found in that position.
    $counts{$nt}[$i]++;
  }
}
## Iterate over the counts hash
for my $nt (sort keys(%counts)){
  print "$ntt";
  ## dereference the array stored in the hash
  my @countsForThisNt = @{$counts{$nt}};
  ## iterate over the counts for each position for this nt
  for (my $l=0;$l<=$#countsForThisNt; $l++) {#
    ## If the value for this position isn't defined,
    ## set it to 0.
    $countsForThisNt[$l]||=0;
    ## Print all the things
    printf "%.2ft", $countsForThisNt[$l]/$., $l;
  }
  print "n";
}

Save the script somewhere in your PATH, make it executable and run:

$ foo.pl file  
A   0.13    0.40    0.13    0.00    0.00    0.40    0.47    0.07    0.40    
C   0.40    0.27    0.00    0.00    0.00    0.07    0.07    0.13    0.27    
G   0.20    0.20    0.67    1.00    0.00    0.47    0.20    0.73    0.20    
T   0.27    0.13    0.20    0.00    1.00    0.07    0.27    0.07    0.13

Alternatively, if you're into the whole brevity thing, and enjoy some golfing, here's the same thing as a one-liner:

perl -ne 'chomp;@F=split(//);for$i(0..$#F){$k{$F[$i]}[$i]++}}{for $nt(sort keys(%k)){print"$ntt";for$i(0..$#{$k{$nt}}){$g=$k{$nt}[$i]||0; printf"%.2ft",$g/$.}print"n";}' file

benn · Answer

Here an example in R using Biostrings and letterFrequency (as suggested by Devon Ryan).

library(Biostrings)

data <- read.table("DNA.txt", stringsAsFactors = F)

new <- matrix(nrow = 9, ncol = 15)

for(i in 1:9){
  for(j in 1:15){
    new[i,j] <- substring(data[j,], i, i)
  }
}

countTable <- matrix(nrow = 9, ncol = 4)
for(i in 1:9){
  columnSeq <- DNAStringSet(paste0(new[i,], collapse = ""))
  columnCounts <- letterFrequency(columnSeq, letters = "ACGT", OR = 0)
  countTable[i,] <- columnCounts
}
colnames(countTable) <- c("A", "C", "G", "T")

freqTable <- countTable/15
round(t(freqTable), digit = 2)
  [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
A 0.13 0.40 0.13    0    0 0.40 0.47 0.07 0.40
C 0.40 0.27 0.00    0    0 0.07 0.07 0.13 0.27
G 0.20 0.20 0.67    1    0 0.47 0.20 0.73 0.20
T 0.27 0.13 0.20    0    1 0.07 0.27 0.07 0.13

Devon Ryan · Answer

Here is one example of how to do this with a bit of python. Alternatively one could create strings of each column and using letterFrequency() from the Biostrings package.

#Make a list of hashes
hl = []
for i in range(9):
    hl.append({'A': 0, 'C': 0, 'G': 0, 'T': 0})

f = open("foo.txt")  # CHANGE ME
nLines = 0
for line in f:
    for idx, c in enumerate(line.strip()):
        hl[idx][c] += 1
    nLines += 1
f.close()

nLines = float(nLines)
for char in ['A', 'C', 'G', 'T']:
    print("{}t{}".format(char, "t".join(["{:0.2f}".format(x[char]/nLines) for x in hl])))

The output of your example is then:

A   0.13    0.40    0.13    0.00    0.00    0.40    0.47    0.07    0.40
C   0.40    0.27    0.00    0.00    0.00    0.07    0.07    0.13    0.27
G   0.20    0.20    0.67    1.00    0.00    0.47    0.20    0.73    0.20
T   0.27    0.13    0.20    0.00    1.00    0.07    0.27    0.07    0.13

calculating nucleotide frequency per column

7 Answers

Add your own answers!

Ask a Question