Combine two columns of a dataframe and then compare

Question

I have a dataframe of 3 columns and 859 rows. The dataframe is like
df1:
MacroNode                                     Prefix                               Suffix
AAACCGCCAATATCTCGACGAGAAAAGCGAC      GCCAACTGGATAACCACGCCCTG                GCCAACTGGATAACCACGCCC
ATTTCTGCGAGGTGCAGGGCAATTACATCAT       TAGGCCTT                               AAAACCCTTGGAA

These are basically the node and prefix and suffix edges of a graph:
macronode + suffix = prefix of next macronode + that next macronode

I have to see what is the maximum stretch i can achieve by the rows present in this data frame. Thus I think first I have to combine the rows and then compare. But I am not able to understand how to do this. Any ideas are welcome.
Expected Outcome
I am giving a short dataframe here
Toy df:
MacroNode        Prefix          Suffix
 GC                T               A
 CA                G               C
 AC                C               T
 CT                A               A

As you can see here if you take the macronode's characters with the suffix character of the first row (GC + A) it is equal to the next row's prefix character + that next row's macronode's characters (G + CA).
But in my dataframe there is no guarantee that the rows are contiguous like in the toy example here I mentioned.
Then the output shall look like
The maximum continuous path is :
TGCAGCACCACTACTA which is 16 characters.
First few rows of the original dataframe:
    MacroNode                         Prefix                               
 1. AAACCGCCAATATCTCGACGAGAAAAGCGAC   GCCAACTGGATAACCACGCCCTGAGACTCAAGGGCGT
 2. AAACTTCTGCCGGAATATAAAGCCGCGCCGG   AGCAAAGCGCGCCACTTCACCCTGAGCTT
 3. AAAGCATTGTGGCCGGAACCGATGACGCGCC   CGGCGTCCCCTGGATGATGGCTTT
 4. AACACCACGCTGGAGATGGTTGCTGAACGTG   AAATTATTAGAATTACAAGGGATTGCC
 5. AACCAGAGCGTTCTGTTACGTGATGTGAACG   AAGTTGCGCCGGGTAGGCGTTACTTTGCTG
 6. AACGAAGTTCAGCCGCGTGCGAACGGTCAGG   GGTATACGCTTCTGCTTCACGAATGTATTGCTGTT
 7. AACTCGGGGCTCGGTCAGCACACCACGACCG   AAAGAGATCCTGACCAACGATATCTCTGAC
 8. AAGCGGTTGAGGAAGGGAAAATCGCGGAAAC   ACCGATCCGGGCTGCGCTATCCGGG
 9. AAGGCGCTCGTTGATGAACTGGAGCTGGCGC   AATTTCGCGTTGCAGTCTGACTCTGCACGTCTT
10. AATATCGACCAGCAATTCGCCTAAAAAGAAG   CCGCTGCCCGTGGATCAACCAGT
11. AATCCACACGTTCAGCAACCATCTCCAGCGT   ATCCACTGGACGAGCTACGCCGCTT
12. AATCGCGATATTTACACAGACCTAAATAGTC 
                                   
                               GCAAACACGATACCGATCCGGGCTGCGCTATCCGGGAAGCGGT

13. AATTTCCGGCGCGGCTTTATATTCCGGCAGA   ACAGACGCTCGCGAGT
14. ACCACCCAGCACGATGCCAGAAATCAGTGGG   AAACAGCGGCTCTCCACTGCCAGAGCAT
15. ACCAGCGTGCCTTCCATCATGTTCATTGCTA   GCAGATCCGTGCTAACGCGGTCGTT
16. ACTGTTCCGGCGTGGCATTAGGTGTTGATCG   CAGGCATACCGACTT
17. CCCTGGCCGTTTGCTTCGGCTTCGTGCTGGG   ACTCTGGGTGTTG

Suffix
 1. TAATGCCCTGATGCACGGCACC
 2. GTCTCGATATACAGACGCTCGCGAGTAATTT
 3. ATCCCCATCGCATTCA
 4. TGGATTATCCACTGGACGAGCTACG
 5. ATAACGCACAAACGCTGGCAAACCTGA
 6. TTGTACGCACGCGCCTCTTCGAGGATACGTTGCG
 7.  C
 8. CCGTTTCGAAAACTATC
 9. AGCTGTCTGCCAATAA
10. TCAATCGCGAGGCCGGTTCGTT
11. AGGGATTGCCAACACC
12. CTCAGGGCTTTGTCGAATTCCAT
13. AGTTTAGCAAAGCGCGCCACTTCACCCTGAGCTTCCAGG
14. CCATGCGTGCTGCCAATGTA
15. GCTGGATATTCTGGTTGATGATGGTCATGTTCGCGGCCTGG
16. CAACGCTAAAGGCGATGACTTCAGCCAGTGTCTCCGCGCCCAGCGCCAACATCACCAGA
17. TAGCTTCATGCTGTAATGATCAATCGCGGGGC

I have written the suffix column separately as it was not fitting in the same line.

Istrel · Answer

As requirements were clarified a few times, I will post a new solution based on the graph theory (actually, the idea belongs to @Martin Wettstein, see comments to question). Of course, there could be problems in cases of cyclic graphs, still, that will be another question.

The script creates graph from the adjacency matrix and calculates the longest path (diameter) through the graph.

As provided subset of real data does not contain continuous sequences, I will use dummy data from the previous version of the answer.

library(dplyr)
library(igraph)
 
dat_txt <- "MacroNode        Prefix          Suffix
 GC                T               A
 CA                C               C
 AC                C               T
 CT                A               A
 GC                T               A
 CA                G               C
 AC                C               T
 CT                A               A"

# Concat strings
dat <- read.table(text = dat_txt, header = TRUE)
res <- dat %>%
  mutate(cur = paste0(MacroNode, Suffix),
         follow = paste0(Prefix, MacroNode),
         full = paste0(Prefix, MacroNode, Suffix))

# Prepare adjacency matrix
edge_mat <- outer(seq_len(nrow(res)), seq_len(nrow(res)), function(r, c) {
  return(res[r, "cur"] == res[c, "follow"])
})

# Construct graph
res_g <- graph_from_adjacency_matrix(edge_mat)

# Get the path with maximum length
g_diam <- get_diameter(res_g)

# Concatenate longest path
long_seq <- paste(res[g_diam, "full"], collapse = "")

Here is a result:

> long_seq
[1] "TGCAGCACCACTACTA"

Jeff Henderson · Answer

Not sure if I understood but to combine all three cols into one you can use this:
df1$newcol ->  paste0(MacroNode, Prefix, Suffix)

If you want a space between each:
df1$newcol ->  paste0(MacroNode, " ",Prefix, " ",Suffix)

Combine two columns of a dataframe and then compare

2 Answers

Add your own answers!

Ask a Question