Data Science Asked by George Liu on April 4, 2021
I’m doing text classification on text messages generated by consumers and just realized even though most of the replies provided by consumers are in English, some are in French. I’ve used Keras word embedding, conv1D and maxpooling to learn the structure in the text and didn’t use any other text preprocessing techniques such as stop words removal etc.
In this case, I think it should be fine to use word embedding on both languages since word embedding learns the meaning of individual words regardless of languages…Is this reasonable? Or maybe I do need to separate the languages and build different models for each language?
In this case, I think it should be fine to use word embedding on both languages since word embedding learns the meaning of individual words regardless of languages...Is this reasonable? Or maybe I do need to separate the languages and build different models for each language?
If I think logically you are correct. Word embedding is merely a collection of Tokens, which derived its features on the basis of nearby words in a sentence. So if you have sufficient raw data(mix of both), I think its good to go, though results will explain you more :).
However its good to see how such models will behave in case we have mix of LeftToRight(LTR) and RTL languages.
Correct answer by vipin bansal on April 4, 2021
Depends on data distribution. Some consumers might speak in mixed ways and they might be very few who do this and your Word Embedding might not learn anything from those data. It might be helpful to have a feature that tells or approximately tells that a conversation has mixed, Pure English or Pure French language.
Answered by bonez001 on April 4, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP