Data Science Asked by StuckInPhDNoMore on February 28, 2021
I have some experience with images and have played around with image classification using CNN’s but have limited knowledge when it comes to text data.
The input that I currently want to classify is written as:
hjkhghkgfghjkhghkgfghfefdefdcdefghjkjh-hjhgfe
fdcd-dd-fdc-dad-ad-dfe-cde-dggf-ghd-gg-bcd
hjkhghkgfghjkhghkgfghfefdefdcdefghjkjh-gh-gfed
dh-hg-gf-gh-dh-hg-gf-gh-hkhg-kh-hg-gf-gh-hkhg-kh-hg-gf-ghh-hgfg-dfd-dc-fgf-gh
I have over 2000 rows of this data, that needs to be classified. I know that for regular text data RNN networks and LSTM cells have been known t be very effective. Using RNN+LSTM good results can be achieved by pre-processing the data using the usual approaches such as stemming, lemmitization, stop word filtering, tokenization etc. But the same cannot be applied to the text data I have.
Would RNN and LSTM still work on my data? If not which networks do you guys suggest I explore for such a task?
You need character embeddings. I assume you are already familiar with word2vec technology. Its goal it to make a model "learn" the relative meaning of words, placing them into a highly dimensional space.
The same can be done with single characters, instead of whole words. The preprocessing steps you need will be a little bit different, but the embedding technique is the same. In that way, you can generate representations of characters, feed their sequences into some RNN model, and perform the final classification task.
Therefore, RNNs are perfectly suitable for this task. If you are working with tensorflow.keras
you can simply tokenize characters, and feed them through an Embedding()
layer that will do the job for you. An alternative to RNNs is 1D conv layers, that can do the job as an alternative to recurrent cells. That's up to your preference.
Correct answer by Leevo on February 28, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP