Stack Overflow Asked on December 7, 2021
I have a dataframe of 3 columns and 859 rows. The dataframe is like
df1:
MacroNode Prefix Suffix
AAACCGCCAATATCTCGACGAGAAAAGCGAC GCCAACTGGATAACCACGCCCTG GCCAACTGGATAACCACGCCC
ATTTCTGCGAGGTGCAGGGCAATTACATCAT TAGGCCTT AAAACCCTTGGAA
These are basically the node and prefix and suffix edges of a graph:
macronode + suffix = prefix of next macronode + that next macronode
I have to see what is the maximum stretch i can achieve by the rows present in this data frame. Thus I think first I have to combine the rows and then compare. But I am not able to understand how to do this. Any ideas are welcome.
Expected Outcome
I am giving a short dataframe here
Toy df:
MacroNode Prefix Suffix
GC T A
CA G C
AC C T
CT A A
As you can see here if you take the macronode’s characters with the suffix character of the first row (GC + A) it is equal to the next row’s prefix character + that next row’s macronode’s characters (G + CA).
But in my dataframe there is no guarantee that the rows are contiguous like in the toy example here I mentioned.
Then the output shall look like
The maximum continuous path is :
TGCAGCACCACTACTA which is 16 characters.
First few rows of the original dataframe:
MacroNode Prefix
1. AAACCGCCAATATCTCGACGAGAAAAGCGAC GCCAACTGGATAACCACGCCCTGAGACTCAAGGGCGT
2. AAACTTCTGCCGGAATATAAAGCCGCGCCGG AGCAAAGCGCGCCACTTCACCCTGAGCTT
3. AAAGCATTGTGGCCGGAACCGATGACGCGCC CGGCGTCCCCTGGATGATGGCTTT
4. AACACCACGCTGGAGATGGTTGCTGAACGTG AAATTATTAGAATTACAAGGGATTGCC
5. AACCAGAGCGTTCTGTTACGTGATGTGAACG AAGTTGCGCCGGGTAGGCGTTACTTTGCTG
6. AACGAAGTTCAGCCGCGTGCGAACGGTCAGG GGTATACGCTTCTGCTTCACGAATGTATTGCTGTT
7. AACTCGGGGCTCGGTCAGCACACCACGACCG AAAGAGATCCTGACCAACGATATCTCTGAC
8. AAGCGGTTGAGGAAGGGAAAATCGCGGAAAC ACCGATCCGGGCTGCGCTATCCGGG
9. AAGGCGCTCGTTGATGAACTGGAGCTGGCGC AATTTCGCGTTGCAGTCTGACTCTGCACGTCTT
10. AATATCGACCAGCAATTCGCCTAAAAAGAAG CCGCTGCCCGTGGATCAACCAGT
11. AATCCACACGTTCAGCAACCATCTCCAGCGT ATCCACTGGACGAGCTACGCCGCTT
12. AATCGCGATATTTACACAGACCTAAATAGTC
GCAAACACGATACCGATCCGGGCTGCGCTATCCGGGAAGCGGT
13. AATTTCCGGCGCGGCTTTATATTCCGGCAGA ACAGACGCTCGCGAGT
14. ACCACCCAGCACGATGCCAGAAATCAGTGGG AAACAGCGGCTCTCCACTGCCAGAGCAT
15. ACCAGCGTGCCTTCCATCATGTTCATTGCTA GCAGATCCGTGCTAACGCGGTCGTT
16. ACTGTTCCGGCGTGGCATTAGGTGTTGATCG CAGGCATACCGACTT
17. CCCTGGCCGTTTGCTTCGGCTTCGTGCTGGG ACTCTGGGTGTTG
Suffix
1. TAATGCCCTGATGCACGGCACC
2. GTCTCGATATACAGACGCTCGCGAGTAATTT
3. ATCCCCATCGCATTCA
4. TGGATTATCCACTGGACGAGCTACG
5. ATAACGCACAAACGCTGGCAAACCTGA
6. TTGTACGCACGCGCCTCTTCGAGGATACGTTGCG
7. C
8. CCGTTTCGAAAACTATC
9. AGCTGTCTGCCAATAA
10. TCAATCGCGAGGCCGGTTCGTT
11. AGGGATTGCCAACACC
12. CTCAGGGCTTTGTCGAATTCCAT
13. AGTTTAGCAAAGCGCGCCACTTCACCCTGAGCTTCCAGG
14. CCATGCGTGCTGCCAATGTA
15. GCTGGATATTCTGGTTGATGATGGTCATGTTCGCGGCCTGG
16. CAACGCTAAAGGCGATGACTTCAGCCAGTGTCTCCGCGCCCAGCGCCAACATCACCAGA
17. TAGCTTCATGCTGTAATGATCAATCGCGGGGC
I have written the suffix column separately as it was not fitting in the same line.
As requirements were clarified a few times, I will post a new solution based on the graph theory (actually, the idea belongs to @Martin Wettstein, see comments to question). Of course, there could be problems in cases of cyclic graphs, still, that will be another question.
The script creates graph from the adjacency matrix and calculates the longest path (diameter) through the graph.
As provided subset of real data does not contain continuous sequences, I will use dummy data from the previous version of the answer.
library(dplyr)
library(igraph)
dat_txt <- "MacroNode Prefix Suffix
GC T A
CA C C
AC C T
CT A A
GC T A
CA G C
AC C T
CT A A"
# Concat strings
dat <- read.table(text = dat_txt, header = TRUE)
res <- dat %>%
mutate(cur = paste0(MacroNode, Suffix),
follow = paste0(Prefix, MacroNode),
full = paste0(Prefix, MacroNode, Suffix))
# Prepare adjacency matrix
edge_mat <- outer(seq_len(nrow(res)), seq_len(nrow(res)), function(r, c) {
return(res[r, "cur"] == res[c, "follow"])
})
# Construct graph
res_g <- graph_from_adjacency_matrix(edge_mat)
# Get the path with maximum length
g_diam <- get_diameter(res_g)
# Concatenate longest path
long_seq <- paste(res[g_diam, "full"], collapse = "")
Here is a result:
> long_seq
[1] "TGCAGCACCACTACTA"
Answered by Istrel on December 7, 2021
Not sure if I understood but to combine all three cols into one you can use this:
df1$newcol -> paste0(MacroNode, Prefix, Suffix)
If you want a space between each:
df1$newcol -> paste0(MacroNode, " ",Prefix, " ",Suffix)
Answered by Jeff Henderson on December 7, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP