Cross Validated Asked by bandit_king28 on December 14, 2020
I have a long list of DNA strings (of equal length) made of 4 letters (A,T,G,C). I want to do a binary classification on the strings. I have two basic quetsions:
The dataset looks like the following:
String ———————————————– Class
ATTGCCCGCGCGCCG————————— 1
AGGCGCGCAGCAGCA—————————2
GCGCGCAGCAGGACA—————————1
I have tried to divide each string into overlapping subsets of length 3,4,5 and then use TFIDF or countvectorizer to find their vector representation.Finally, I have used a classifier to train on these vectors and reported the results. But the accuracy won’t go above 63%.
For sequence data, the default model is LSTM. It's able to model long sequences and has a much better representative power than linear models. Take a look at PyTorch's tutorial if you're new to it.
If I have a large enough dataset, I usually don't bother to remove the duplicates.
Answered by yiping on December 14, 2020
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP