Computational Science Asked by Tiago Minuzzi on January 18, 2021
I’m a PhD student in genetics and molecular biology working on an algorithm to identify if a DNA sequence is either a transposable element (TE) or not a TE using convolutional neural networks, and it’s already working kind of the way I’d like it to (of course I’m always trying to improve it).
The input is a FASTA file containing multiple DNA sequences. The algorithm analyses each sequence and returns if it is or not a TE, but here is the thing: not necessarily the whole sequence is a TE; in many cases, just a fragment (like a sub-string of the string) is a TE.
I’d like to know if there is a way to map the coordinates and/or return the fragment representing the TE. For me it seems kinda tricky because of all the sequence pre-processing of one hot encoding, flattening etc, and I don’t know how the sequences of zeros and ones that the original became can return me what I want. Although I know some python and I’m studying machine learning and deep learning to know how it works, my area is biological sciences not computer science or something related.
Here I’ll try to exemplify the described above.
Let’s say I have these three sequences, the sub-string in lower case (just for the sake of the example, it’ll not be like this) is the TE.
>NAD4
TAATATTAAGATaggattgggattgtatgaagggttaaaattaatatttctataatattaatagaaaaaaagttgttaagatttttatttacgaagccatgttgagttcttCCAAAAA
>NAD4-V
CTAGTTAAAAGTAAATGTTaagataaggattgggattgtatgaagggttaaaattaatatttctataatattaatagaaaaaaagttgttAAGATTTTTATTTACGAAGCCATGTTGAG
>STL-M
TCGAAGAAGGGGTCATTAAATTTACTTTTGCTTTTTATACTATATTAGATCTTAAATCGTTTATATGTTTTTTTTAAAAAAACTATAAAGTTACCCACAAATAGAAAATTTGTTGTGCT
I’d like to have something like the following as the output:
ID Classification Coordinates Sequence
NAD4 TE 13:112 aggattgggattgtatgaagggttaaaattaatatttctataatattaatagaaaaaaagttgttaagatttttatttacgaagccatgttgagttctt
NAD4-V TE 20:91 aagataaggattgggattgtatgaagggttaaaattaatatttctataatattaatagaaaaaaagttgtt
STL-M NT NaN NaN
Am I asking too much from the neural network and I’ll have to use some tool/custom script after the prediction to figure out the sequences and/or coordinates?
I'll start with a disclaimer, my PhD is in the fast computation of eigenvalues, my specialty is not in machine learning at all. This is just some stuff I remember from some master level courses. I have two ideas that might work.
Idea 1
Traditional convolutional neural nets are very good at classifying. For example, "does this image contain a dog", or in your case "does this sequence contain a TE". The reason for this is translational invariance. That's a fancy term to say that these nets, by their convolutional nature, tend not to care where something is in an image or sequence, only what it is. This makes them way better at generalizing.
When people started to use convolutional neural nets to find out where something is in an image, not just whether it is present, they had to change the architecture of the neural net. Those neural nets use branches of fully connected layers and convolutional layers and reconnect them later to recover the information about the location. You could do something similar to recover the location of the TE sequence.
Depending on your luck you might be able to use the object localization networks that other people have designed for your use cases.
Idea 2
When dealing with sequences, recurrent neural networks (RNN) usually work quite well. Instead of dealing with the entire sequence at once, they take in token by token and as such can provide information about the location. The natural language processing researchers have had a lot of success using these networks, but they are quite sensitive to the training in my experience, YMMV.
Good luck
Correct answer by Thijs Steel on January 18, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP